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Preface 


Although the primary objectives and basic plan of this text have remained 
the same as in the first edition, the present revision places relatively more 
emphasis on principles of psychological testing. This has been accomplished 
both by the expansion of Part 1 and by the frequent discussion of special issues 
and methodological or interpretive problems throughout the text. Even more 
than in the earlier edition, the book is designed to teach the student haw to 
evaluate psychological tests and interpret test results., The specific tests dis- 
cussed in Parts 2, 3, and 4 provide continuing opportunities for illustrating and 
applying the principles introduced in Part 1. A 

The pace at which psychological testing is developing can be gauged from 
the fact that about a third of the tests discussed in this edition have either 
originated or been revised since the publication of the first edition. For many 
of the others, there are revised manuals, technical supplements, or major re- 
Search publications that provide important new information. Little of what was 
said about specific tests in 1954 has remained unchanged in 1961. 

In such a field, one must be ready to judge the merits of new tests as they 
appear. To enable the reader to do this has been a constant aim of the book. In 
line with this goal, Chapter 2 provides an expanded discussion of sources of in- 
formation about tests. Bibliographies at the end of each chapter include many 
referenées to publications about specific tests, as well as to more advanced or 
specialized readings on each topic. In this connection it should be noted that 
the practice of citing references by number has been retained in the present 
edition. The citation by author and date, currently recommended by the APA 
Council of Editors for journal use, appears to be too unwieldy for textbook 
purposes, especially in an area where many references may have to be cited for 
à single statement. Under these conditions, the page would become cluttered 
with names and dates of little intrinsic interest to the college student. Through 
the unobtrusive bibliography numbers given in parentheses, any desired refer- 
ence can easily be located at the end of the chapter. 
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As a further aid to the evaluation of tests and the interpretation of test 
scores, new sections on elementary statistical concepts have been included in 
Chapters 4 and 5. The computation of the most common measures is illustrated 
with simple examples. Special attention has also been given to a clarification 
and systematization of the various procedures for determining test reliability 
and validity. The treatment of validity has been expanded to cover two chap- 
ters, while the chapter on item analysis from the earlier edition has been con- 
densed and integrated with other topics. More detailed treatment has also been 
given to the interpretation of aptitude profiles, the identification of traits 
through factor analysis, and the evaluation of the various multiple aptitude bat- 
teries developed within the past decade. 

A sample of new topics covered in the present edition includes: research on 
test anxiety, on coaching, and on the influence of examiner variables on test 
performance (Ch. 3); clinical versus statistical utilization of test results (Ch. 
7); decision theory in the interpretation of test scores (Ch. 7); research on the 
measurement of creativity (Ch. 15); the changing relation between achieve- 
ment and aptitude tests (Ch. 16 and 17); procedures for improving teacher- 
made classroom tests (Ch. 16); the social desirability variable in personality 
inventories (Ch. 18); the place of interests in personality theory (Ch. 19); a 
re-evaluation of projective techniques in the light of recent, well-controlled 
studies (Ch. 20); and an overview of many new approaches to personality 
assessment (Ch. 21). 

In a real sense, this book has evolved in the classroom. Much of the discus- 
Sion centers around recurrent questions that students ask about tests. Although 
intended primarily as a college text, such a book should also prove helpful to 
the practitioner in a number of fields. It provides a comprehensive view of cur- 
rent tests and testing problems for anyone who uses tests, such as the counselor, 
School psychologist, personnel psychologist in industry or government, and 
clinical psychologist. Among the parts of special relevance to clinical psy- 
chology, for example, may be mentioned the chapters on individual intelligence 
tests, projective techniques, and other personality measures. The book should 
likewise aid in the proper understanding and interpretation of test scores on the 
part of teachers, principals, social workers, psychiatrists, and others who utilize 
the results of tests in their daily activities. Educators will be particularly in- 
terested in the two chapters on achievement tests, as well as in the several chap- 
ters on group intelligence and aptitude tests. For the general psychologist, the 
book furnishes a background for the critical evaluation of the growing body of 
research data obtained through psychological tests. 

It is a pleasant task to acknowledge the cooperation of colleagues in the 
preparation of this book. I wish to express my sincere appreciation to the 
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many authors and test publishers who provided photographs of test materials, 
specimen tests, manuals, reprints, and unpublished manuscripts. In each of 
these cases, specific acknowledgment has been made in the text. I am especially 
grateful for the promptness, courtesy, and thoroughness with which my in- 
numerable questions were answered by mail and telephone. Thanks are also 
extended for the thoughtful recommendations submitted by course instructors, 
several of which were utilized in preparing this revision. I am happy to record 
the contribution of my husband, Dr. John P. Foley, Jr., who discussed and 
helped to solve countless problems as they arose throughout the preparation 
of the manuscript. To my colleague, Dr. Dorothea McCarthy, 1 am indebted 
for her helpful suggestions and her ready cooperation in many ways. Grateful 
acknowledgment is made to Miss Rowena Plant and Miss Margaret Tighe of 
the Fordham University library staff for their gracious and efficient assistance 
in bibliographic matters and to Miss Mary Ellen Anderson for her competent 
help in proofreading and indexing. 

A. A. 
New York City 
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PART l 


Principles of Psychological Testing 


CHAPTER l 


Functions and Origins of 
K 


Psychological Testing 


Anyone reading this book today could undoubtedly illustrate what is meant 
by a psychological test. It would be easy enough to recall a test the reader 
himself has taken in school, in college, in the armed services, in the coun- 
seling center, or in the personnel office. Or perhaps the reader has served as a 
subject in an experiment in which standardized tests were employed. This 
would certainly not have been the case fifty years ago. Psychological testing 
is a relatively young branch of one of the youngest of the sciences. 


CURRENT USES OF PSYCHOLOGICAL TESTS 


Basically, the function of psychological tests is to measure differences be- 
tween individuals or between the reactions of the same individual on differ- 
ent occasions. One of the first problems that stimulated the development of 
psychological tests was the identification of the feebleminded. To this day, 
the detection of intellectual deficiency remains an important application of 
certain types of psychological tests. Related clinical uses of tests include the 
examination of the emotionally maladjusted, the delinquent, and other types 
of subnormal deviants. A strong impetus to the early development of tests 
was likewise provided by problems arising in education. At present, schools 
are among the largest test users. The classification of children with reference 
to their ability to profit from different types of school instruction, the identifi- 
cation of the intellectually retarded on the one hand and the gifted on the 
other, the diagnosis of academic failures, the educational and vocational 
counseling of high school and college students, and the selection of applicants 
for professional and other special schools are some of the many educational 
uses to which tests are being put. In a somewhat different setting, the testing 
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of children for adoption placement illustrates another specific way in which 
tests aid in practical decisions. 

The selection and classification of industrial personnel represent relatively 
recent and rapidly expanding applications of psychological testing. From 
the assembly-line operator or filing clerk to top management, there is scarcely 
a type of job for which some kind of psychological test has not proved help- 
ful in such matters as hiring, job assignment, transfer, promotion, or termina- 
tion. To be sure, the eflective employment of tests in many of these situations, 
especially in connection with high-level jobs, usually requires that the tests be 
used as an adjunct to skillful interviewing, so that test scores may be properly 
interpreted in the light of other background information about the individual. 
Nevertheless, testing constitutes an important part of the total personnel pro- 
gram. A closely related application of psychological testing is to be found in 
the selection and classification of military personnel. From simple beginnings 
in World War I, the scope and variety of psychological tests employed in 
military situations showed a phenomenal increase during World War Il. Sub- 
sequently, research on test development has been continuing on a large scale 
in all branches of the armed services. 

It is clearly evident that psychological tests are currently being employed 
in the solution of a wide range of practical problems. One should not, how- 
ever, lose sight of the fact that such tests are also serving important func- 
tions in basic research. Nearly all problems in differential psychology, for 
example, require testing procedures as a means of gathering data. As illustra- 
tions, reference may be made to studies on the nature and extent of individ- 
ual differences, the identification of psychological traits, the measurement of 
group differences, and the investigation of biological and cultural factors as- 
sociated with behavioral differences. For all such areas of research—and 
for many others—the precise measurement of individual differences made 
possible by well-constructed tests is an essential prerequisite. Similarly, psy- 
chological tests provide standardized tools for investigating such varied prob- 
lems as age changes within the individual, the effects of education. the out- 
come of psychotherapy, the impact of propaganda, and the influence of 
distraction on performance. 

From the many different uses of psychological tests, it follows that some 
knowledge of such tests is needed for an adequate understanding of most 
fields of contemporary psychology. It is primarily with this end in view that 
the present book has been prepared. The book is not designed to make the in- 
dividual either a skilled examiner and test administrator, or an expert On 
test construction. It is directed, not to the test specialist, but to the general 
student of psychology. Some acquaintance with the leading current tests is 
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necessary in order to understand references to the use of such tests in the 
psychological literature. And a proper evaluation and interpretation of test 
results must ultimately rest upon a knowledge of how the tests were con- 
structed, what they can be expected to accomplish. and what are their pe- 
culiar limitations. Today a familiarity with tests is required, not only by those 
who give or construct tests, but by the general psychologist as well. 

A brief overview of the historical antecedents and origins of psychological 
testing will provide perspective and should aid in the understanding of pres- 
ent-day tests.! The direction in which contemporary psychological testing has 
ing can be clarified when considered in the light of the pre- 


been progress 
cursors of such tests. The special limitations as well as the advantages that 


characterize current tests likewise become more intelligible when viewed 
against the background in which they originated. 


EARLY INTEREST IN THE CLASSIFICATION AND 
TRAINING OF THE FEEBLEMINDED 


The nineteenth century witnessed a strong awakening of interest in the 
humane treatment of the feebleminded and the insane. Prior to that time, neg- 
lect, ridicule, and even torture had been the common lot of these unfortu- 
nates. With the growing concern for the proper care of mental deviates came 
à realization that some uniform criteria for identifying and classifying these 
cases were required. The establishment of many special institutions for the 
care of the feebleminded in both Europe and America made the need for 
setting up admission standards and an objective system of classification espe- 
cially urgent. First it was necessary to differentiate between the insane 
and the feebleminded. The former manifested emotional disorders which 
might or might not be accompanied by intellectual deterioration from an ini- 
tially normal level; the latter were characterized essentially by intellectual de- 
fect which had been present from birth or early infancy. What is probably 
the first explicit statement of this distinction is to be found in a two-volume 
work published in 1838 by the French physician, Esquirol (9), in which 
over one hundred pages are devoted to feeblemindedness. 

Esquirol also pointed out that there are many degrees of feebleminded- 
ness, varying along a continuum from normality to low-grade idiocy. In the 
effort to develop some system for classifying the different degrees and varieties 
of feeblemindedness, Esquirol tried several procedures, but concluded that 
the individual's use of language provides the most dependable criterion of 


1 A detailed account of the early origins of psychological tests can be found in Goodenough 
(12) and Peterson (26). Cf. also Boring (6) and Murphy (23) for more general background, 
and Anastasi (1, Ch. 1) for historical antecedents of the study of individual differences. 
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his intellectual level. On this basis, he distinguished between two grades of 
imbecility and three grades of idiocy. In the higher degree of imbecility. he 
maintained, speech is employed readily and easily; in the lower-grade imbe- 
cile, speech is more difficult and the vocabulary more limited. The highest 
grade of idiot uses only a few words or very short phrases; the second level of 
idiot is able to utter only monosyllables and cries; and in the lowest-level 
idiot, no language is found at all (9, vol. I, p. 340). It is interesting to note 
that current criteria of feeblemindedness are also largely linguistic. and that 
present-day intelligence tests are heavily loaded with verbal content. The im- 
portant part verbal ability plays in our concept of intelligence will be re- 
peatedly demonstrated in subsequent chapters. 

Of special significance are the contributions of another French physician. 
Seguin, who pioneered in the training of the feebleminded. Having rejected 
the prevalent notion of the "incurability" of mental deficiency, Seguin experi- 
mented for many years with what he designated the "physiological method" 
of training (cf. 28), and in 1837 he established the first school devoted to the 
education of mentally defective children. In 1848 he emigrated to America, 
where his ideas gained wide recognition. Many of the "sense training" and 
"muscle training" techniques currently in use in institutions for the feeble- 
minded were originated by Seguin. By these methods, low-grade mental de- 
fectives are given intensive exercise in sensory discrimination and in the de- 
velopment of motor control. Some of the procedures developed by Seguin 
for this purpose were eventually incorporated into “performance” or non- 
verbal tests of intelligence. An example is the Seguin Form Board, in which 


the individual is required to insert variously shaped blocks into the corre- 
sponding recesses as quickly as possible. 


THE FIRST EXPERIMENTAL PSYCHOLOGISTS 


The early experimental psychologists of the nineteenth century were not, 
in general, concerned with the measurement of individual differences. The 
principal aim of psychologists of that period was the formulation of general- 
ized descriptions of human behavior. It was the uniformities rather than the 
differences in behavior that were the focus of attention. Individual differences 
were either ignored or were accepted as a necessary evil which limited the 
applicability of the generalizations. Thus the fact that one individual reacted 
differently from another when observed under identical conditions was re- 
garded as a form of "error." The presence of such error, or individual vari- 
ability, rendered the generalizations approximate rather than exact. This was 
the attitude toward individual differences that prevailed in such laboratories 
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as that founded by Wundt at Leipzig in 1879, where many of the early ex- 
perimental psychologists received their training. 

In their choice of topics. as in many other phases of their work. the found- 
hology reflected the influence of their backgrounds 


ers of experimental ps 
in physiology and physics. The problems studied in their laboratories were 


concerned largely with sensitivity to visual. auditory, and other sensory stim- 
uli. and with simple reaction time. Such an emphasis upon sensory phenom- 
ena was in turn reflected in the nature of the first psychological tests. as will 
be apparent in subsequent sections. 

Still another way in which nineteenth-century experimental psychology 
influenced the course of the testing movement may be noted. The early psy- 
chological experiments brought out the need for rigorous control of the condi- 
tions under which observations were made. For example, the wording of 
directions given to the subject in a reaction-time experiment. might appre- 
ciably increase or decrease the speed with which the subject responded. Or 
again, the brightness or color of the surrounding field would markedly alter 
the appearance of a visual stimulus. The importance of making observations 
on all subjects under standardized conditions was thus vivid y demonstrated. 
Such standardization of procedure eventually became one of the special ear- 


marks of psychological tests. 


THE CONTRIBUTIONS OF FRANCIS GALTON 


It was the English biologist, Sir Francis Galton. who was primarily re- 
Sponsible for launching the testing movement on its course. A unifying factor 
in Galton's numerous and varied research activities was his interest in human 
heredity. In the course of his investigations on heredity, Galton realized the 
need for measuring the characteristics of related and unrelated persons. Only 
in this way could he discover, for example, the exact degree of resemblance 
between parents and offspring. brothers and sisters, cousins, or twins. With 
such an end in view. Galton was instrumental in inducing a number of edu- 
cational institutions to keep systematic anthropometric records on their stu- 
dents. In 1882, he established an anthropometric laboratory in South Ken- 
sington Museum, London, where by the payment of a small fee individuals 
could be measured in certain physical traits and could undergo tests of keen- 
ness of vision and hearing. muscular strength. reaction time, and other simple 
Sensorimotor functions. By such methods, the first large, systematic body of 
data on individual differences in simple psychological processes was gradually 
accumulated. 

Galton himself devised most of the simple tests administered at his an- 
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thropometric laboratory, many of which are still familiar either an mes 
original or in modified forms. Examples include the “Galton bar” for visual dis- 
crimination of length, the “Galton whistle" for determining the highest audible 
pitch, and graduated series of weights for measuring kinaesthetic discrimina- 
tion. as well as tests of strength, speed of reaction, and other traits. It was 
Galton’s belief that tests of sensory discrimination could serve as a means of 
gauging a person’s intellect. In this respect, he was partly influenced by the 
theories of Locke. Thus Galton wrote: “The only information that reaches 
us concerning outward events appears to pass through the avenue of our 
senses; and the more perceptive the senses are of difference, the larger is the 
field upon which our judgment and intelligence can act" (10, p. 27). Galton 
had also noted that idiots tend to be defective in the ability to discriminate 
heat, cold, and pain—an observation that further strengthened his conviction 
that sensory discriminative capacity “would on the whole be highest among 
the intellectually ablest" (10, p. 29). ; 

Galton also pioneered in the application of rating scale and questionnaire 
methods, as well as in the use of the “free association" technique subse- 
quently employed for a wide variety of purposes. A further contribution of 
Galton is to be found in his development of statistical methods for the 
analysis of data on individual differences. Galton selected and adapted a 
number of techniques previously derived by mathematicians. These tech- 
niques he put in such form as to permit their use by the mathematically un- 
trained investigator who might wish to treat test results quantitatively. He 
thereby extended enormously the application of statistical procedures to the 
analysis of test data. This phase of Galton's work has been carried forward 
by many of his students, the most eminent of whom was Karl Pearson. 


CATTELL AND THE EARLY *MENTAL TESTS" 


An especially prominent position in the development of psychological test- 
ing is occupied by the American psychologist, James McKeen Cattell. The 
newly established science of experimental psychology and the still newer test- 
ing movement merged in Cattell's work. For his doctorate at Leipzig. he com- 
pleted a dissertation on individual differences in reaction time, despite 
Wundt’s resistance to this type of investigation. While lecturing at Cambridge 
in 1888, Cattell’s own interest in the measurement of individual differences 
was reinforced by contact with Galton. Upon his return to America, Cattell 
was active both in the establishment of laboratories for experimental psy- 
chology and in the spread of the testing movement. 

In an article written by Cattell in 1890 (7), the term “mental test" was 
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used for the first time in the psychological literature. This article described a 
series of tests which were being administered annually to college students 
in the effort to determine their intellectual level. The tests. which had to be 
administered individually, included measures of muscular strength, speed of 
movement, sensitivity to pain, keenness of vision and of hearing, weight dis- 
crimination, reaction time, memory, and the like. In his choice of tests. Cat- 
tell shared Galton's view that a measure of intellectual functions could be ob- 
tained through tests of sensory discrimination and reaction time. Cattell's 
preference for such tests was also bolstered by the fact that simple functions 
could be measured with precision and accuracy, whereas the development 
of objective measures for the more complex functions appeared at that time 
à well-nigh hopeless task. 

Cattell's tests were typical of those to be found in a number of test series 
developed during the last decade of the nineteenth century. Some efforts to 
tap more complex psychological functions may be seen in the inclusion of 
tests of reading, verbal association, memory, and simple arithmetic (22, 29), 
Such test series were administered to school children, college students, and 
miscellaneous adults. At the Columbian Exposition held in Chicago in 1893, 
Jastrow set up an exhibit at which visitors were invited to take tests of 
sensory, motor, and simple perceptual processes and to compare their skill 
with the norms (cf. 26, 27). A few attempts to evaluate such early tests 
yielded very discouraging results. The individual's performance showed little 
correspondence from one test to another (29, 37), and it exhibited little or no 
relation to independent estimates of intellectual level based on teachers’ rat- 
ings (5, 11) or academic grades (37). 

A number of test series assembled by European psychologists of the period 
tended to cover somewhat more complex functions. Kraepelin (20), who 
was interested primarily in the clinical examination of psychiatric patients, 
prepared a long series of tests to measure what he regarded as basic factors 
in the characterization of an individual. The tests, employing chiefly simple 
arithmetic operations, were designed to measure practice effects, memory, 
and susceptibility to fatigue and to distraction. A few years earlier, Oehrn 
(24), a pupil of Kraepelin, had employed tests of perception, memory, asso- 
ciation, and motor functions in an investigation on the interrelationships of 
Psychological functions. Another German psychologist, Ebbinghaus (8), ad- 
ministered tests of arithmetic computation, memory span, and sentence com- 
pletion to school children. The most complex of the three tests, sentence com- 
pletion, was the only one that showed a clear correspondence with the 
children’s scholastic achievement. 

Like Kraepelin, the Italian psychologist, Ferrari, and his students were in- 
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terested primarily in the use of tests with pathological cases (13). The test 
series they devised ranged from physiological measures and motor tests to 
apprehension span and the interpretation of pictures. In an article published 
in France in 1895, Binet and Henri (3) criticized most of the available test 
series as being too largely sensory and as concentrating unduly on simple, 
specialized abilities. They argued further that, in the measurement of the 
more complex functions, great precision is not necessary, since individual 
differences are larger in these functions. An extensive and varied list of tests 
was proposed, covering such functions as memory, imagination, attention, 
comprehension, suggestibility, aesthetic appreciation, and many others. In 
these tests, we can readily recognize the trends that were eventually to lead 
to the development of the famous Binet "intelligence scales." 


BINET AND THE RISE OF INTELLIGENCE TESTS 


Binet and his co-workers devoted many years to active and ingenious re- 
search on ways of measuring intelligence. Many approaches were tried, in- 
cluding even the measurement of physical traits, handwriting analysis, and 
palmistry! The results, however, led to a growing conviction that the direct, 
even though crude, measurement of complex intellectual functions was the 
best solution. Then a specific situation arose which brought Binet's efforts to 
immediate practical fruition. In 1904, the Minister of Public Instruction ap- 
pointed a commission to study procedures for the education of subnormal 
children attending the Paris schools. It was to meet this practical demand 
that Binet. in collaboration with Simon, prepared the first Binet-Simon Scale 
(4). 

This scale, known as the 1905 Scale, consisted of 30 problems or tests ar- 
ranged in ascending order of difficulty. The difficulty level was determined 
empirically by administering the tests to 50 normal children aged 3 to 11 
years, and to some retarded and feebleminded children. The tests were de- 
signed to cover a wide variety of functions, with special emphasis upon judg- 
ment, comprehension, and reasoning, which Binet regarded as essential com- 
ponents of intelligence. Although sensory and perceptual tests were included. 
a much greater proportion of verbal content was found in this scale than in 
most test series of the time. The 1905 Scale was presented as a preliminary 
and tentative instrument, and no precise objective method for arriving at à 
total score was formulated. 

In the second, or 1908 Scale, the number of tests was increased, some un- 
satisfactory tests from the earlier scale were eliminated, and all tests were 
grouped into age levels. Thus in the 3-year level were placed all tests normal 
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3-year-olds could pass, in the 4-year level all tests passed by normal 4-year- 
olds, and so on to age 13. The child's score on the test could then be ex- 
pressed as a "mental age,” i.e., the age of normal children whose performance 
he equaled. The use of such mental age norms, which achieved considerable 
popularity in later stages of psychological testing, will be discussed in more 
detail in Chapter 4. Since “mental age" is such a simple concept to grasp, its 
introduction undoubtedly did much to popularize intelligence testing. 

A third revision appeared in 1911, the year of Binet's untimely death. In 
this scale, no fundamental changes were introduced. Minor revisions and re- 
locations of specific tests were instituted. More tests were added at several 
year levels, and the scale was extended to the adult level. 

Even prior to the 1908 revision, the Binet-Simon tests attracted wide at- 
tention among psychologists throughout the world. Translations and adapta- 
tions appeared in many languages. In America, a number of different re- 
Visions were prepared, the most famous of which is the one developed under 
the direction of L. M. Terman at Stanford University, and known as the 
Stanford-Binet (34). It was in this test that the Intelligence Quotient (IQ), 
or ratio between mental age and chronological age, was first used. The latest 
revision of this test is widely employed today and will be more fully consid- 
ered in Chapter 8. Of special interest, too, is the first Kuhlmann-Binet revi- 
sion, which extended the scale downward to the age level of 3 months (21). 
This scale represents one of the earliest efforts to develop preschool and in- 


fant tests of intelligence. 


GROUP TESTING 


The Binet tests, as well as all their revisions, are individual scales in the 
sense that they can be administered to only one person at a time. Many of the 
tests in these scales require oral responses from the subject or necessitate 
the manipulation of materials. Some call for individual timing of responses. For 
these and other reasons, such tests are not adapted to group administration. 
Another characteristic of the Binet type of test is that it requires a highly 
trained examiner. Such tests are essentially clinical instruments. suited to the 
intensive study of individual cases. 

Group testing, like the first Binet scale, was developed to meet a pressing 
practical need. When the United States entered World War I in 1917, a com- 
mittee was appointed by the American Psychological Association to consider 
Ways in which psychology might aid in the conduct of the war. This commit- 
tee, under the direction of Robert M. Yerkes, recognized the need for the 
rapid classification of the million and a half recruits with respect to general 
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intellectual level. Such information was of assistance in many administrative 
decisions, including rejection or discharge from military service, assignment 
to different types of service, or admission to officer training camps. It was in 
this setting that the first group intelligence test was developed. In this task, the 
Army psychologists drew upon all available test materials, and especially 
upon an unpublished group intelligence test prepared by Arthur S. Otis, 
which he turned over to the Army. 

The tests finally developed by the Army psychologists have come to be 
known as the Army Alpha and the Army Beta. The former was designed for 
general routine testing; the latter was a non-language scale employed with 
illiterates and with foreign-born recruits who were unable to take a test in 
English. Both were suitable for administration to large groups. 

Shortly after the termination of World War I, the Army tests were released 

for civilian use. Not only did the Army Alpha and Army Beta themselves 
pass through many revisions, the latest of which are even now in use. but 
they also served as models for most group intelligence tests. The testing 
movement underwent a tremendous spurt of growth. Soon group intelligence 
tests were being devised for all ages and types of persons, from preschool 
children to graduate students. Large-scale testing programs, previously im- 
possible, were now being launched with zestful optimism. Since group tests 
were designed as mass testing instruments, they not only permitted the simul- 
taneous examination of large groups, but they also simplified the instructions 
and administration procedures so as to demand a minimum of training on the 
part of the examiner. School teachers began to give tests to their classes. Col- 
lege students were routinely examined prior to admission. Extensive studies 
of special adult groups, such as prisoners, were undertaken. And soon the 
general public became “IQ-conscious.” 
. The application of such group intelligence tests far outran their technical 
improvement. That the tests were still crude instruments was often forgotten 
in the rush of gathering scores and drawing practical conclusions therefrom. 
When the tests failed to meet unwarranted expectations, skepticism and hos- 
tility toward all testing often resulted. Thus the testing boom of the twenties. 
based upon the indiscriminate use of tests, may have done as much to retard 
as to advance the progress of psychological testing. 


TESTS OF SPECIAL APTITUDES 


Although intelligence tests were originally designed to sample a wide vari- 
ety of functions in order to estimate the individual's "general intellectual 
level," it soon became apparent that such tests were quite limited in their 
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coverage. Not all important functions were represented. In fact, most intelli- 
gence tests were primarily measures of verbal ability and, to a lesser extent, 
of the ability to handle numerical and other abstract and symbolic relations. 
Gradually psychologists came to recognize that the term “intelligence test” 
was a misnomer, since only certain aspects of intelligence were measured by 
such tests. 

To be sure, the tests covered abilities that are of prime importance in our 
culture. But it was realized that more precise designations, in terms of the 
type of information these tests are able to yield, would be preferable. For 
example, a number of tests that would probably have been known as “in- 
telligence tests” during the twenties were now described as “scholastic apti- 
tude tests." Such a shift in terminology was made in recognition of the fact 
that many so-called intelligence tests measure that combination of abilities 
demanded by academic work. 

Some intelligence tests have been given the name of "general classification 
tests" or "screening tests." An example is the Army General Classification 
Test (AGCT) developed during World War Il to serve the same general 
purposes as the Army Alpha of World War I. Tests for use in industrial per- 
sonnel selection are also frequently designated as general classification tests. 
Such a name itself has grown out of the fact that intelligence tests, and espe- 
cially group intelligence tests, are often used as rough, preliminary screening 
instruments and are then followed by more detailed measures of special 
aptitudes. Among the latter are to be found tests of mechanical, clerical, 
Musical, and artistic aptitudes. Standardized tests in all these areas are avail- 
able and are widely used today for educational and vocational counseling, 
Personnel selection, and other purposes. 

The need for special aptitude tests to supplement so-called intelligence tests 
is now generally recognized. In this connection, it is interesting to note the re- 
sults of a "poll of experts" conducted in 1944 among a representative group 
of psychologists in the testing field (18). Of the 79 psychologists replying, 55 
believed that “most will be accomplished if psychologists concentrate on 
Measuring separate intellectual factors.” At the other extreme, only 5 ex- 
Pressed the opinion that test development should be oriented primarily to- 
ward the measurement of general intelligence. It should not be inferred from 
Such replies, of course, that these test experts were dissatisfied with current 
intelligence tests or that they advocated the abolition of such tests. On the 
contrary, when asked how well intelligence tests meet the practical needs for 
classifying people as to general ability in the army, in schools, and in indus- 
try, over 75 per cent chose the response, "Rather well, much better than is 
done without tests." The comments following this question, however, again 
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indicated that intelligence tests were regarded as approximate or preliminary 
screening instruments which ought to be supplemented by tests of special 
aptitudes. 

Mention may also be made of the contrast between the testing programs in 
World War I and World War II. Such a comparison vividly illustrates the 
direction in which psychological testing had progressed during the intervening 
quarter-century. Although in World War I the Army Alpha and Army Beta 
represented the major part of the psychological testing in the armed forces. 
in World War ll the AGCT and similar general tests constituted a rela- 
tively small portion of the total test construction effort. The chief contribution 
of test psychologists during World War II was in the development of special- 
ized test “batteries,” or combinations of tests. Research on such batteries was 
conducted on a vast scale and attained heretofore undreamed-of proportions. 
Special batteries were constructed for pilots, bombardiers, radio operators. 
range finders, and scores of other military specialists. A report of the batteries 
prepared in the Air Force alone occupies at least nine of the nineteen vol- 
umes devoted to the aviation psychology program during World War II (2). 
Some available tests of special aptitudes, especially in the mechanical and 
motor areas, were utilized in the military test batteries; but many of the tests 
were specially devised for the purpose. Such military tests also illustrate an- 


other aspect of current test development which will be considered in the fol- 
lowing section. 


MULTIPLE APTITUDE BATTERIES 


The critical evaluation of intelligence tests which followed their widespread 
and indiscriminate use during the twenties also revealed another noteworthy 
fact, namely, that an individual’s performance on different parts of such a 
test often showed marked variation. This was especially apparent on group 
tests, in which the items are commonly segregated into subtests of relatively 
homogenous content. For example, a person might score relatively high on a 
verbal subtest and low on a numerical subtest, or vice versa. To some extent, 
such internal variability is also discernible on a test like the Stanford-Binet. 
in which, for example, all items involving words might prove very difficult for 
a particular individual, while items employing pictures or geometric diagrams 
may place him at an advantage. 

Test users, and especially clinicians, frequently utilized such intercompari- 
sons in order to obtain more insight into the individual's psychological make- 
up. Thus not only the “IQ,” or other total score, but also scores on subtests 
would be examined in the evaluation of the individual case. Such a practice 
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is not to be generally recommended, however, since intelligence tests were not 
designed for the purpose of differential aptitude analysis. Often the subtests 
being compared contain too few items to yield a stable or reliable estimate of 
à specific ability. As a result, the obtained difference between subtest scores 
might be reversed if the individual were retested on a different day or with 
another form of the same test. If such intra-individual comparisons are to be 
made, tests are needed that are specially designed to reveal differences in per- 
formance in various functions. 

While the practical application of tests demonstrated the need for differen- 
tial aptitude tests, a parallel development in the study of trait organization 
was gradually providing the means for constructing such tests. Statistical 
studies on the nature of "intelligence" had been exploring the interrelations 
among scores obtained by many persons on a wide variety of different tests. 
Such investigations were begun by the English psychologist, Charles Spear- 
man, during the first decade of the present century (31, 32). Subsequent 
methodological developments, based upon the work of such American psy- 
chologists as T. L. Kelley (17) and L. L. Thurstone (35, 36), as well as upon 
that of other American and English investigators, have come to be known 
as factor analysis. 

The contributions that the methods of factor analysis have made to test 
construction will be more fully examined and illustrated in Chapter 13. For 
the present, it will suffice to note that the data gathered by such procedures 
have indicated the presence of a number of relatively independent “factors.” 
or traits. Some of these traits were represented, in varying proportions, in the 
traditional intelligence tests. Verbal comprehension and numerical reasoning 
are examples of this type of trait. Others, such as spatial, perceptual. and 
mechanical aptitudes, had been touched upon only slightly, if at all, in most 
intelligence tests. 

One of the chief practical outcomes of factor analysis was the development 
of multiple aptitude batteries. These batteries are designed to provide a meas- 
ure of the individual's standing in each of a number of traits. In place of a 
total score or “IQ,” a separate score is obtained for such traits as verbal com- 
Prehension, numerical aptitude, spatial visualization, arithmetic reasoning, 
and perceptual speed. Such batteries thus provide a suitable instrument for 
making the kind of intra-individual analysis, or differential diagnosis, that 
Clinicians had been trying for many years to obtain from intelligence tests, 
with crude and often erroneous results. These batteries also incorporate into 
à comprehensive and systematic testing program much of the information 
formerly Obtained from special aptitude tests, since the multiple aptitude bat- 
teries cover some of the traits not ordinarily included in intelligence tests, 
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Multiple aptitude batteries represent a relatively late development in the 
testing field. Nearly all have appeared since 1945. In this connection, the 
work of the military psychologists during World War II should again be cited. 
Much of the test research conducted in the armed services was based on 
factor analysis and was directed toward the construction of differential apti- 
tude batteries. This was especially true of the previously mentioned work of 
the Air Force psychologists. Research along these lines is still in progress 
under the sponsorship of various branches of the armed services. A number of 
differential aptitude batteries have likewise been developed for civilian use. 
and are being widely applied in educational and vocational counseling. per- 
sonnel selection, and similar areas. The principal examples of such batteries 
will be discussed in Chapter 13. 


MEASUREMENT OF PERSONALITY 


A phase of psychological testing that is still in its infancy is represented by 
the various efforts to measure non-intellectual aspects of behavior. Tests de- 
signed for this purpose are commonly known as “personality tests,” although 
some psychologists prefer to use the term “personality” in a broader sense, to 
refer to the entire individual. Intellectual as well as non-intellectual traits 
would thus be included under this heading. In the terminology of psycholog- 
ical testing, however, the designation “personality test" most often refers to 
measures of such characteristics as emotional adjustment, social relations. 
motivation, interests, and attitudes. 

An early precursor of personality testing may be recognized in Kraepelin’s 
use of the free association test with abnormal patients. In such a test, the 
subject is given specially selected stimulus words and is required to respond 
to cach with the very first word that comes to mind. Kraepelin also employed 
this technique to study the psychological effects of fatigue, hunger, and drugs. 
and concluded that all these agents increase the relative frequency of super- 
ficial associations (19). Sommer (30), also writing during the last decade of 
the nineteenth century, suggested that the free association test might be used 
to differentiate between the various forms of mental disorder. The free asso- 
ciation technique has subsequently been utilized for a variety of testing pur- 
poses and is still currently employed. Mention should also be made of the 
work of Galton, Pearson, and Cattell in the development of standardized 
questionnaire and rating-scale techniques. Although originally devised for 
other purposes, these procedures were eventually employed by others in con- 
structing some of the most common types of current personality tests. 

The prototype of the personality questionnaire, or self-report inventory. is 
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the Personal Data Sheet developed by Woodworth during World War I (cf. 
33. Ch. 5). This test was designed as a rough screening device for identifying 
seriously neurotic men who would be unfit for military service. The inven- 
tory consisted of a number of questions dealing with common neurotic symp- 
toms, which the individual answered about himself. A total score was ob- 
tained, in terms of the number of symptoms reported. Immediately after the 
war, civilian forms of this questionnaire were prepared, including a special 
form for use with children. The Woodworth Personal Data Sheet, moreover, 
served as a model for most subsequent emotional adjustment inventories. In 
some of these questionnaires, an attempt was made to subdivide emotional 
adjustment into more specific forms, such as home adjustment, school adjust- 
ment, and vocational adjustment. Other tests concentrated more intensively 
upon a narrower area of behavior, or were concerned with more distinctly 
social responses, such as ascendance-submission in personal contacts. A later 
development was the construction of tests for quantifying the expression of 
interests and attitudes. These tests, too, were based essentially upon question- 
Naire techniques. 

Another approach to the measurement of personality is through the appli- 
cation of performance or situational tests. In such tests, the subject has a task 
to perform whose purpose is generally disguised. Most of these tests simulate 
everyday-life situations quite closely. The subjects reactions in these situa- 
tions are observed without his knowledge. The first extensive application of 
such techniques is to be found in the tests developed in the late twenties and 
early thirties by Hartshorne and May (14, 15. 16). This series, standardized 
On school children, was concerned with such behavior as cheating, lying. 
Stealing, cooperativeness, and persistence. Objective, quantitative scores could 
be obtained on each of a large number of specific tests. A more recent illus- 
tration, for the adult level, is provided by the series of situational tests devel- 
oped during World War II in the Assessment Program of the Office of Strategic 
Services (35). These tests were concerned with relatively complex and subtle 
Social and emotional behavior, and required rather elaborate facilities and 
trained personnel for their administration. The interpretation of the subject’s 
Tesponses, moreover, was relatively subjective. 

Projective techniques represent a third approach to the study of personality, 
and one that has shown phenomenal growth, especially among clinicians. In 
such tests, the subject is given a relatively “unstructured” task which permits 
wide latitude in its solution. The assumption underlying such methods is that 
the individual will "project" his characteristic modes of response into such a 
lask. Like the performance and situational tests, projective techniques are 
More or less disguised in their purpose, thereby reducing the chances that 
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the subject can deliberately create a desired impression. The previously cited 
free association test represents one of the earliest types of projective tech- 
nique. A certain form of sentence-completion test has likewise been used in a 
similar manner. Other tasks commonly employed in projective techniques in- 
clude drawing, arranging toys to create a scene, extemporaneous dramatic 
play, ranking photographs in order of preference, and interpreting a series of 
pictures or inkblots. 

All the available types of personality tests present serious difficulties, both 
practical and theoretical. Each approach has its own special advantages and 
weaknesses. The specific problems encountered in personality test construc- 
tion will be considered in later chapters. For the present, it will suffice to 
point out that personality testing lags far behind aptitude testing in its posi- 
tive accomplishments. Nor is such lack of progress to be attributed to insuffi- 
cient effort. Research on the measurement of personality has reached vast 
proportions during the past decade, and many ingenious devices and tech- 
nical improvements are under investigation. It is rather the special difficulties 


encountered in the measurement of personality that account for the slow ad- 
vances in this area. 
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The general public still identifies psychological tests primarily with intelli- 
gence tests. The rapid growth and widespread application of group intelli- 
gence tests following World War I have left their mark upon the popular con- 
Cept of what constitutes a psychological test. Moreover, such tests are often 
loosely described as “IQ tests." Such a designation undoubtedly reflects the 
Popular appeal of age norms and of the intelligence quotient as a technique 
for reporting the individual's intellectual status. The term "IO test” is, how- 
ever, misleading. It is to be hoped that its use will gradually disappear as the 
Public learns more about psychological tests. The IO refers Not to a type of 
lest but to a particular way of interpreting scores on certain Psychological 
tests. Moreover, the IO is applicable to relatively few psychological tests, as 
Will be seen in Chapter 4. Other, more precise and more widely applicable 
Scoring procedures have been developed and are being employed increasingly 
in present-day tests. 

Intelligence tests themselves represent only one of several types of cur- 
rently available psychological tests. As can be seen from the historical intro- 
duction in Chapter 1, many other kinds of psychological tests have been de- 
vised. In terms of both technical excellence and practical value, these other 
types of tests are often superior to the general intelligence tests. It thus seems 
4ppropriate, before proceeding further, to inquire into the exact nature of a 
Psychological test. Psychological tests are more varied and broader in scope 
than might at first appear. What, then, constitutes a Psychological test? What 
are its essential characteristics? 


WHAT IS A PSYCHOLOGICAL TEST? 


A Psychological test is essentially an objective and standardized measure 


ofa Sample of behavior. Psychological tests are like tests in any other sci- 
S.C. EK E weal Senga 
Date m [> pu 7 vot. 


22 Principles of Psychological Testing 


ence, in so far as observations are made upon a small but carefully chosen 
sample of an individual's behavior. In this respect, the psychologist proceeds 
in much the same way as the chemist who tests a shipment of iron ore or a 
supply of water by analyzing one or more samples of it. If the psychologist 
wishes to test the extent of a child's vocabulary, or a clerk's ability to perform 
arithmetic computations, or a pilot's eyc-hand coordination, he examines 
their performance with a representative set of words, or arithmetic problems, 
or motor tests. Whether or not the test adequately covers the behavior under 
consideration obviously depends upon the number and nature of items in the 
sample. For example, an arithmetic test consisting of only five problems, or 
one including only multiplication items, would be a poor measure of the indi- 
vidual’s computational skill. A vocabulary test composed entirely of baseball 
terms would hardly provide a dependable estimate of a child’s total range of 
vocabulary. 

The diagnostic or predictive value of a psychological test depends upon 
the degree to which it serves as an indicator of a relatively broad and signifi- 
cant area of behavior. Measurement of the behavior sample directly covered 
by the test is rarely, if ever, the goal of psychological testing. The child’s 
knowledge of a particular list of 50 words is not, in itself, of great interest. 
Nor is the job applicant's performance on a specific set of 20 arithmetic 
problems of much importance. If, however, it can be demonstrated that 
there is a close correspondence between the child's knowledge of the word 
list and his total mastery of vocabulary, or between the applicant's score on 
the arithmetic problems and his computational performance on the job, then 
the tests are serving their purpose. 

i It should be noted in this connection that the test items need not resemble 
closel avi est is i je — ERN 

y the behavior the test is to predict. It is only necessary that an empirical 
correspondence be demonstrated between the two. The degree of similarity 
between the test sample and the predicted behavior may vary wisely At one 
extreme, the test may coincide completely with a par A 

redicted. A à part of the behavior to be 
predicted. An example might be a foreign vocabulary test in which the stu 

f e stu- 
dents are examined on 20 of the 50 new words they have studied ther 
A . S ave studied; ano 
example is provided by the road test taken prior to obtaining a d iv rs li- 
: a a drive 
cense. A lesser degree of similarity is illustrated b id ntl asked 
dmini : strated by many vocational aptitude 
tests administered prior to job training, i hi : 
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their superficial differences, all these tests consist of samples of the individu- 
al’s behavior. And each must prove its worth by an empirically demonstrated 
correspondence between the subject’s performance on the test and in other 
situations. 

Whether the term “diagnosis” or “prediction” is employed in this connec- 
tion also indicates a minor distinction. Prediction commonly connotes a tem- 
poral estimate, the individual's future performance on a job, for example, 
being forecast from his present test performance. In a broader sense, how- 
ever, even the diagnosis of present condition, such as feeblemindedness or 
emotional disorder, implies a prediction of what the individual will do in situ- 
ations other than the present test. It is logically simpler to consider all tests as 
behavior samples from which predictions regarding other behavior can be 
made. Different types of tests can then be characterized as variants of this 
basic pattern. 

Another point that should be considered at the outset pertains to the con- 
cept of capacity. It is entirely possible, for example, to devise a test for pre- 
dicting how well an individual can learn French before he has even begun the 
study of French. Such a test would involve a sample of the types of behavior 
required to learn the new language, but would in itself presuppose no knowl- 
edge of French. It could then be said that this test measures the individual’s 
“capacity” or “potentiality” for learning French. Such terms should, however, 
be used with caution in reference to psychological tests. Only in the sense 
that a present behavior sample can be used as an indicator of other, future 
behavior can we speak of a test measuring "capacity." No psychological 
test can do more than measure behavior. Whether such behavior can serve 
as an effective index of other behavior can be determined only by empirical 
try-out. 

Standardization. It will be recalled that in the initial definition a psycho- 
logical test was described as a standardized measure. Standardization implies 
uniformity of procedure in administering and scoring the test. If the scores 
obtained by different individuals are to be comparable, testing conditions 
must obviously be the same for all. Such a requirement is only a special 
application of the need for controlled conditions in all scientific observations. 
In a test situation, the single independent variable is usually the individual 
being tested. 

In order to secure uniformity of testing conditions, the test constructor 
Provides detailed directions for administering each newly developed test. The 
formulation of such directions is a major part of the standardization of a new 
test. Such standardization extends to the exact materials employed, time 
limits, oral instructions to subjects, preliminary demonstrations. ways of han- 
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dling queries from subjects. and every other detail die testing situation: 
Many other. more subtle factors may influence the Subjects performance on 
certain tests. Thus in giving instructions or presenting problems orally. gans 
sideration must be given to the rate of speaking, tone of voice, inflection, 
pauses. and facial expression. In a test involving the detection of absurdities. 
for example, the correct answer may be given away by smiling or pausing 
when the crucial word is read. 

In so far as possible, the surroundings should also be standardized. Cer- 
tainly adequate lighting, proper ventilation, and freedom from discomfort 
and distraction should be common requirements in all testing situations. At- 
tention should also be given to motivating the subject, arousing his interest, 
eliciting his cooperation, and establishing "rapport." The question of rapport 
will be considered more fully in Chapter 3. In the present connection, how- 
ever, it should be noted that in this regard. as in other aspects of testing. 
conditions must be standardized as much as possible for all subjects. 

Norms. Another important step in the standardization of a test is the estab- 
lishment of norms. Without norms, test scores cannot be interpreted. Psycho- 
logical tests have no predetermined standards of "passing" or "failing." An 
individual's score can be evaluated only by comparing it with the scores ob- 
tained by others. As its name implies. a norm is the "normal" or average 
performance. Thus if normal 8-year-old children complete 12 out of 50 prob- 
lems correctly on a particular arithmetic reasoning test, then the 8-year-old 
norm on this test corresponds to a score of 12. The latter is known as the 
“raw score” on the test. It may be expressed as number of correct items, time 
required to complete a task, number of errors, or some other objective 
measure appropriate to the content of the test. Such a raw score is meaning- 
less until evaluated in terms of a suitable set of norms. 

In the process of standardizing a test, it must be administered to a large. 
representative sample of the type of subjects for whom it is designed. This 
group, known as the standardization sample, serves to establish the norms. 
Such norms indicate not only the average performance but also the relative 
frequency of varying degrees of deviation above and below the 
thus possible to evaluate different degrees of superiority 
specific ways in which such norms may be expressed w 
Chapter 4. All permit the designation of the individu 
ence to the normative or standardization sample. 
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tests, the norm corresponds to the performance of typical or average individ- 
uals. On ascendance-submission tests, for example, the norm falls at an 
intermediate point representing the degree of ascendance or submission mani- 
fested by the average individual. Similarly, in an emotional adjustment in- 
ventory, the norm does not ordinarily correspond to a complete absence of 
unfavorable or maladaptive responses, since a few such responses occur in the 
majority of "normal" individuals in the standardization sample. It is thus ap- 
parent that psychological tests, of whatever type, are based upon empirically 
established norms. 

Objective Measurement of Difficulty. Reference to the definition of a psy- 
chological test with which this discussion opened will show that such a test 
was characterized as an objective as well as a standardized measure. In what 
specific ways are such tests objective? Some aspects of the objectivity of psy- 
chological tests have already been touched upon in the discussion of stand- 
ardization. Thus the administration, scoring, and interpretation of scores are 
objective in so far as they are independent of the subjective judgment of the 
individual examiner. Any one individual should theoretically obtain the 
identical score on a test regardless of who happens to be his examiner. This 
is not entirely so, of course, since perfect standardization and objectivity have 
not been attained in practice. But at least such objectivity is the goal of test 
construction and has been achieved to a reasonably high degree in most 
tests. 

There are other major ways in which psychological tests can be properly 
described as objective. The determination of the difficulty level of an item 
or of a whole test, and the measurement of test reliability and validity, are 
based upon objective, empirical procedures. The concepts of reliability and 
validity will be considered in subsequent sections. We shall turn our attention 
first to the concept of difficulty. 

When Binet and Simon prepared their original, 1905 Scale for the meas- 
urement of intelligence (cf. Ch. 1), they arranged the 30 items of the scale 
in order of increasing difficulty. Such difficulty, it will be recalled, was deter- 
mined by trying out the items on 50 normal and a few retarded and feeble- 
minded children. The items correctly solved by the largest proportion of sub- 
jects were, ipso facto, taken to be the easiest; those passed by relatively few 
subjects were regarded as more difficult items. By such a procedure, an 
empirical order of difficulty was established. This early example typifies the 
Objective measurement of difficulty level, which is now common practice in 
psychological test construction. 

Not only the arrangement but also the selection of items for inclusion in a 
lest can be determined by the proportion of subjects in the trial samples who 
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pass each item. Thus if there is a bunching of items at the easy or difficult 
end of the scale, some items can be discarded. Similarly, if items are sparse in 
certain portions of the difficulty range. new items can be added to fill in the 
gaps. 

Frequency of correct response is also employed in constructing age scales, 
such as the later revisions of the Binet scales. In such a case, the proportion 
of children at each age level who pass cach item is determined. The item is 
then assigned to that age level at which a certain proportion pass it. 

The difficulty level of the test as a whole is, of course, directly dependent 
upon the difficulty of the items that make up the test. A comprehensive check 
of the difficulty of the total test for the population for which it is designed is 
provided by the distribution of total scores. If the standardization sample is a 
representative cross section of such a population, then it is generally expected 
that the scores will fall roughly into a normal distribution curve. In other 
words, there should be a clustering of individuals near the center of the 
range, and a gradual tapering off as the extremes are approached. A theorcti- 
cal normal curve, with all irregularities eliminated, is shown in Figure 1. In 
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Fig. 1. A Normal Distribution Curve. 


plotting such a frequency distribution, scores are indicated on the baseline, 
and frequencies, or number of persons obtaining each score, on the vertical 
axis. A smooth curve like the one illustrated above is closely approximated 
when very large samples are tested." 
Let us suppose, however, that the obtained distribution curve is not “nor- 
o but clearly skewed, as illustrated in Figures 2A and 2B. The former 
istribution, with a piling of scores < 
biet a rt LE ae res at the low end, suggests that the test has 
ma ! E the group under consideration, lacking a sufficient num- 
er of easy items to discriminate properly at the lower end of the range. The 
pum is that persons who would normally scatter over a considerable range 
o E Te i near-zero scores on this test. A peak at the low end of the 
scale is therefore obtained. ificial pili is i 
ined. Such an artificial piling of scores is illustrated 
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A. Piling at Lower End of the Scale 


B. Piling at Upper End of the Scale 


Fig. 2. Skewed Distribution Curves. 


schematically in Figure 3, in which a normally distributed group yields a 
skewed distribution on a particular test. The opposite skewness is illustrated 
in Figure 2B, with the scores piled up at the upper end, a finding which sug- 
gests insufficient test "ceiling." Administering a test designed for the general 
population to selected samples of college or graduate students will usually 
yield such a skewed distribution, a number of students obtaining nearly perfect 
scores. With such a test, it is impossible to measure individual differences 
among the more able subjects in the group. If more difficult items had been 
included in the test, some individuals would undoubtedly have scored higher 
than the present test permits. 

When the standardization sample yields a markedly non-normal distribu- 
tion on a test, the difficulty level of the test is ordinarily modified until a 
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Fig. 3. Skewness Resulting from Insufficient "Test Floor." 
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normal curve is approximated. Depending upon the type of deviation from 
normality that appears, easier or more difficult items may be added, other 
items eliminated or modified, the position of items in the scale altered, or the 
scoring "weights" assigned to certain responses revised. Such adjustments are 
continued until the distribution becomes at least roughly normal. Under these 
conditions, the most likely score, obtained by the largest number of subjects. 
usually corresponds to about 50 per cent correct items. To the layman who is 
unfamiliar with the methods of psychological test construction, a "50 per 
cent" score may seem shockingly low. It is sometimes objected, on this basis, 
that the examiner has set too low a "standard of passing" on the test. Or the 
inference is drawn that the group tested is a particularly poor one. Both con- 
clusions, of course, are totally meaningless when viewed in the light of the 
procedures followed in developing psychological tests. Such tests are delib- 
erately constructed and specifically modified so as to yield a mean score of 
approximately 50 per cent correct. Only in such a way can the maximum 
differentiation between individuals at all ability levels be obtained with the 
test. With a mean of approximately 50 per cent correct items, there is the 
maximum Opportunity for a normal distribution, with individual scores 
spreading widely at both extremes.* 

Reliability. “How good is this test?" "Does it really work?" These ques- 
tions could —and occasionally do—lead to long hours of futile discussion. 
Subjective opinions, hunches. and personal biases 
to extravagant claims regarding wh 
the other h 


may lead, on the one hand. 
at a particular test can accomplish and, on 
and, to stubborn rejection. The only way in which questions such 
as these can be conclusively answered is by empirical trial. The objective 
evaluation of Psychological tests involves primarily the determination of the 
reliability and the validity of the test in specified situations. 

As used in psychometrics, the term "reliability" always means stability or 
consistency. Test reliability is the consistency of scores obtained by the same 
Persons when retested with the identical test or with an equivalent form of 
the test. If a child receives an 1Q of 110 on Monday and an IQ of 80 when 
retested on Friday, it is obvious th 


à at little or no confidence can be put in 
either score. Simil 


arly, if one set of 50 words enables an individual to identify 
40 correctly, while on another, supposedly equivalent set, he can get a score 
of only 20 right, then neither score can be taken as a dependable index of 
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The normal curve, however, has an advantage if subsequent statisti lyses of Scar are to 
be conducted, since many current statistical techniques assume approximate normality of dis- 
tribution. For this and other reasons, it is likely that most tests designed for Senensi se will 
continue to follow a normal-curve pattern for some time to come. In the construction ERA 
made tests to serve clearly defined purposes, however, the form of the distrib ti Y nes res 
should depend upon the type of discrimination desired (ef. 11, 15). eS a 
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his verbal comprehension. To be sure, in both illustrations it is possible that 
only one of the two scores is in error, but this could be demonstrated only 
by further retests. From the given data, we can conclude only that both 
Scores cannot be right. Whether one or neither is an adequate estimate of the 
individual's true ability we cannot determine without additional information. 

Before a psychological test is released for general use, a thorough, objec- 
tive check of its reliability must be carried out. The different types of test 
reliability, as well as methods of measuring each. will be considered in Chap- 
Ter. Reliability can be checked with reference to temporal fluctuations, the 
Particular selection of items or behavior sample constituting the test. the role 
of different examiners or scorers, and other aspects of the testing situation. It 
is essential to specify the type of reliability and method employed to deter- 
mine it, since the same test may vary in these different aspects. The number 
and nature of individuals on whom reliability was checked should likewise 
be reported. With such information, the test user can predict whether the test 
Will be about equally reliable for the group with which he expects to use it, 
9r whether it is likely to be more reliable or less reliable. 

Validity, Undoubtedly the most important question that needs to be 
raised regarding any psychological test concerns its validity, i.e.. the degree 
to which the test actually measures what it purports to measure. Validity 
Provides a direct check on how well the test fulfills its function. The determi- 
nation of validity usually requires independent, external criteria of whatever 
the test is designed to measure. For example, if a medical aptitude test is to 
be used in selecting promising applicants for medical school, ultimate success 
in medical school would be a criterion. In the process of validating such a 
test, it would be administered to a large group of students at the time of their 
admission to medical school. Some measure of performance in medical school 
would eventually be obtained for each student, on the basis of grades, ratings 
by instructors, success or failure in completing training, and the like. Such a 
Composite measure constitutes the criterion with which each student's initial 
test score is to be correlated. A high correlation, or validity coefficient, would 
Signify that those individuals who scored high on the test had been relatively 
Successful in medical school, while those scoring low on the test had done 
Poorly in medical school. A low correlation would indicate little correspond- 
ence between test score and criterion measure, and hence poor validity for 
the test. The validity coefficient enables us to determine how closely the 
Criterion performance could have been predicted from the test scores. 

In a similar manner, tests designed for other purposes can be validated 
against appropriate criteria. A vocational aptitude test, for example, can be 
Validated against on-the-job success of a trial group of new employees. A 
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pilot aptitude battery can be validated against achievement in flight training 
and eventually against combat performance, if the latter records become 
available. Tests designed for broader and more varied uses arc validated 
against a number of criteria. Thus intelligence tests have been validated 
against such criteria as school achievement, ratings by teachers or supervisors, 
or scores on other, previously validated tests. If, as in the case of intelligence 
tests, it is reasonable to expect scores to increase with age up to a certain 
level, then an examination of the mean scores obtained by successive age 
groups provides another check on validity. Similarly, the test performance 
of institutionalized mental defectives may be compared with that of normal 
school children. In tests of emotional instability, the responses of persons 
known to be neurotic or psychotic may be checked against those of unse- 
lected normal adults. 

The reader may have noticed an apparent paradox in the concept of test 
validity. If it is necessary to follow up the subjects or in other ways to obtain 
an independent measure of what the test is trying to predict, why not dispense 
with the test? The answer to this riddle is to be found in the distinction be- 
tween the validation group on the one hand, and the groups on which the 
test will eventually be employed for predictive purposes on the other. Before 
the test is ready for use, its validity must be established on a representative 
sample of subjects. The scores of these persons are not themselves employed 
for predictive purposes, but serve only in the process of "testing the test." If 
the test proves valid by this method, then it can be used on other samples, 
in the absence of criterion measures. 

It might still be argued that we would only need to wait for the criterion 
measure to "mature," or become available, on any group. in order to obtain 
the information that the test is trying to predict. But such a procedure 
would be so wasteful of time and energy as to be prohibitive in most in- 
stances. Thus we could determine which applicants will succeed on a job or 
which students will satisfactorily complete college by admitting all who apply 
and waiting for subsequent developments! It is the very wastefulness of such 
a procedure that tests are designed to reduce. By means of tests, the individ- 
ual $ eventual performance in such situations can be predicted with a de- 
terminable margin of error. The more valid the test, of course, the smaller 
will be this margin of error. 

An essential precaution i 3 ini alidi 
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: eon a man in an industrial 
plant knows that a particular individual scored very poorly on an aptitude 
test, such knowledge might influence the grade given to the student or the 
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rating assigned to the worker. Or a high-scoring individual might be given the 
benefit of the doubt when academic grades or on-the-job ratings are being 
prepared. Such influences would obviously raise the correlation between test 
Scores and criterion in a manner that is entirely spurious or artificial. This 
possible source of error in test validation is known as criterion contamination, 
since the criterion ratings become "contaminated" by the rater's knowledge of 
the test scores. To prevent the operation of such an error, it is absolutely es- 
sential that no person who participates in the assignment of criterion ratings 
have any knowledge of the subjects' test scores. For this reason, test scores 
employed in "testing the test" must be kept strictly confidential. It is some- 
times difficult to convince teachers, employers, military officers, and other 
line personnel that such a precaution is essential. In their urgency to utilize 
all available information for practical decisions, such persons may fail to 
realize that the test scores must be put aside until the criterion data mature 
and validity can be checked. 

The special problems encountered in determining the validity of different 
types of tests, as well as the specific criteria and statistical procedures em- 
ployed, will be discussed in Chapters 6 and 7. One further point, however, 
should be considered at this time. Validity tells us more than the degree to 
Which the test is fulfilling its function. It actually tells us what the test is 
measuring. By examining the criterion data, together with the validity coefüi- 
cients of the test, we can objectively determine what the test is measuring.? 
It is for this reason that some psychologists prefer to define validity as the ex- 
tent to which we know what the test measures. The interpretation of test 
Scores would undoubedly be clearer and less ambiguous if tests were regu- 
larly named in terms of the criteria against which they had been validated 
(cf. 2). A tendency in this direction can be recognized in such test labels as 
"scholastic aptitude test” (6) and “personnel classification test” (23) in place 


of the vague title “intelligence test.” 


VARIETIES OF PSYCHOLOGICAL TESTS 


Major Types of Psychological Tests. It is customary to classify psycho- 
logical tests with reference to the aspects of behavior which they sample. 
Such a classification is somewhat arbitrary and fluid. as will shortly become 
apparent. For practical convenience, however, there are certain advantages 
in grouping tests in this manner. Moreover, the terms designating these vari- 
Ous test categories are widely used in the psychological literature. Conse- 

3 Several specific procedures for determining what a test measures will be discussed in Chapter 


5. In most of these, however, the basic step is the correlation of test scores with some inde- 
Pendently obtained criterion data. 
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quently, familiarity with such terms is helpful in itself. Outstanding examples 
of current tests in each category will be examined in Parts 2, 3, and 4. In 
general. the organization of tests in the chapters included under Parts 2, 3, 
and 4 will follow the classification given below. 

As indicated in Chapter 1, the development of general intelligence tests, 
to estimate over-all level of intellectual functioning, was one of the earliest 
goals of psychological testing. The early efforts to identify and classify mental 
defectives, the original Binet-Simon tests, and even the research of Galton 
and Cattell with sensorimotor tests were oriented toward the measurement of 
general intelligence. This type of test, with its many varieties and levels, 
still constitutes one of the largest groups of available psychological tests. The 
most widely used tests in this category. ranging from the infant level to 
graduate school, will be discussed in Part 2 (Chs. 8 to 12). 

Multiple aptitude batteries are rapidly replacing general intelligence tests 
for a number of purposes, especially in the counseling of adolescents and 
adults. Although they cover a wide sampling of psychological functions, even 
broader than that included in the traditional intelligence tests, these batteries 
do not ordinarily provide a single over-all score, such as an IQ. Rather, it is 
the principal aim of such batteries to permit differentiation among the indi- 
vidual’s special assets and liabilities—among the high and low spots in his 
"intellectual profile." The construction of multiple aptitude batteries also 
utilizes the latest developments in factor analysis, since the tests that make 
up such batteries are ordinarily chosen to represent the principal traits identi- 
fied by factor analysis. A more comprehensive coverage of abilities, with a 
minimum of needless Overlap of test content, is thereby assured. The major 
multiple aptitude batteries in current use. together with the methods em- 
ployed in their construction, will be considered in Chapter 13. 

Tests of special aptitudes will be illustrated in Chapters 14 and 15, His- 
torically, this type of test antedates the multiple aptitude batteries, 
been first developed to fill in some of the obvious gaps left by the 
telligence tests. Attention was first focused on highly specialized areas that 
intelligence tests made no effort to cover, such as musical, artistic, and me- 
chanical aptitudes. Interest in vocational selection also stimulated the devel- 
opment of certain special aptitude tests, such as those for the prediction of 
clerical aptitude. To some extent, special aptitude tests overlap the functions 
covered by multiple aptitude batteries, although some areas not included in 
IDE latter can be measured by certain special aptitude tests. Today, special 
aptitude tests are available for a wide variety of purposes. They range from 
very specific and simple measures of sensory acuity or speed of finger move- 
ment to complex tests of art appreciation or of aptitude for | 
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A traditional distinction is that between aptitude tests and achievement 
tests. The latter are designed to assess the effects of a specified course of 
training. The principal examples of achievement tests are to be found in 
educational testing. In fact, any standardized examination on a school course 
represents an achievement test. Such tests are now extremely common, 
ranging from the elementary school to the graduate and professional schools, 
and covering practically all subjects of instruction. Trade tests, for use in 
screening and selecting industrial and business employees, constitute another 
type of achievement test. As in educational achievement tests, such trade 
tests assume that the testees have had a specific course of instruction, job ap- 
prenticeship, or other relatively uniform experience. Achievement tests for 
use in both education and industry will be covered in Chapters 16 and 17. 
Following the more detailed discussion in those chapters, it will be seen 
that the distinction between achievement and aptitude tests is relative rather 
than absolute, since each type of test can be used, under certain conditions, 
to appraise the effects of past experience and to predict future accomplish- 
ment. 

The last major category of psychological tests, to be considered in Chapters 
18 to 21, is a very broad one and concerns the measurement of personality 
characteristics. The types of tests conventionally placed under this heading 
include measures of emotional adjustment, also known as tests of neuroticism 
Or emotional instability, or simply as “personality inventories." The category 
also covers measures of social traits, involving primarily relations with other 
Persons, such as ascendance-submission, introversion-extroversion, and self- 
Sufficiency. Social intelligence, covering the knowledge and skills demanded 
in social situations, is sometimes classified with special aptitudes and some- 
times with personality. Tests of character traits, such as honesty, persever- 
are traditionally included under personality tests, 


ance, and cooperativenes 
às are measures of motivation, interests, and attitudes. The questionnaire, 


projective, and situational tests briefly described in Chapter 1 would all, of 
Course, fall under the heading of personality tests. Each of these three tech- 
niques has been applied to the measurement of several of the areas listed 


above. 
Like many of the distinctions made in classifying tests, the dichotomy be- 


tween ability and personality tests is to some extent artificial and debatable. 
In taking any test, the individual is undoubtedly influenced both by ability 
factors and by emotional, motivational, interest, and other non-intellectual 
characteristics. Some tests, in fact, have been employed to measure either 
ability or personality factors, when administered or scored in different ways. 
In the construction of psychological tests, however, emphasis is generally 
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placed upon one or another aspect of behavior. Such emphasis is reflected in 
the instructions to the subjects, the techniques for establishing rapport, scor- 
ing procedures, and other features of the test. 

A useful distinction has been proposed in this connection by Cronbach (8. 
pp. 29-34), who differentiates between tests of maximum performance and 
tests of habitual performance. The former correspond to ability tests (intelli- 
gence, special aptitudes, etc.), the latter to personality tests. It is certainly 
true that the object of ability tests is characteristically to discover how well 
the individual can perform in a certain area, under the most favorable condi- 
tions that can be provided. Every effort is made in such tests to control moti- 
vation, interest, surrounding conditions, and other contributing factors in such 
a way as to insure that the individual is "doing his best" on the test. On the 
other hand, in personality tests it is the usual or habitual reaction of the in- 
dividual that is sought—not what the individual believes is the best solution 
or what he would like to do, but what he actually does in the specified situa- 
tion. 

Other Bases for Classifying Psychological Tests. Psychological tests are 
commonly classified in a number of other ways, some of which may cut 
across the major divisions outlined in the preceding section. For example, it 
's useful to distinguish between individual and group tests. This dichotomy is 
regularly employed with intelligence tests, as indicated in Chapter 1. But it is 
equally applicable to any other type of test. For practical purposes, it is of 
course very important to know whether a test must be administered singly to 
each subject, like the Stanford-Binet, or whether it can be given simultane- 
ously to large groups. Many individu 
aminer and are designed prim 
Study of single cases. 

Individual tests enable the examiner to make valuable 
tions regarding the subject’s work methods and other qualitative aspects of 
performance, his social and emotional reactions, and the like. It has been 
said, for example, that the Stanford-Binet is in effect a standardized inter- 
view, the experienced clinician obtaining much more information from it 
than just the IO. Individual tests also give the examiner a better Opportunity 
to establish rapport, obtain cooperation, and maintain the interest of the sub- 
ject in the test. Any special conditions that may handicap the subject in his 
performance are also more readily noted and remedied in an individual test- 
ing situation. Group tests, on the other hand, not only permit the “mass test- 
ing characteristic of many contemporary testing programs, but they also in- 
sure more uniformity of procedure, since the role of the examiner is reduced 
and simplified, and scoring can be made highly automatic. 


al tests also require a highly trained ex- 
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Another basis for classifying psychological tests is to be found in the testing 
medium. The most familiar differentiation in this connection is that between 
paper-and-pencil and performance tests. This distinction, too, has been ap- 
plied primarily to intelligence tests, although it has more recently proved 
useful with other types, especially personality tests. The familiar paper-and- 
pencil type of test provides each subject with a test form on which all items 
are printed. Responses are written by the subject on either the test form it- 
self or on a separate answer sheet. In some paper-and-pencil tests, the stim- 
uli are presented on phonograph records or tape recordings. Examples in- 
clude tests of musical aptitude (Ch. 15) and tests designed to measure either 
aptitude or achievement in foreign languages (Ch. 17). 

In performance tests, the individual may be required to manipulate objects, 
pictures, blocks, or mechanical apparatus, or he may perform more complex 
üctivities in a typical everyday-life situation. Performance tests have tradi- 
tionally been restricted to individual testing. One reason for such a limitation 
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Fig. 4. The Administration of a Performance Test to a Small Group: Two-Hand 
Coordination Test Used in Air Force Classification Program. (Courtesy U. S. Air Force; 
for description of test. cf. Melton, 16.) 
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is the difficulty of providing duplicate sets of materials, owing to bulkiness, 
purchase cost, and expense of checking and maintenance in the case of cer- 
tain types of equipment. Another reason is that in most performance tests 
each subject could easily see what others are doing. In contemporary, large- 
scale testing programs, however, performance tests are sometimes adminis- 
tered in small groups. Table screens may be employed to conceal each sub- 
ject’s test materials. In the case of many sensorimotor coordination tests, 
moreover, such a precaution may be unnecessary, since observation of an- 
other’s performance would be of no help. Figure 4 illustrates the procedure 


followed by the Air Force in administering apparatus tests for classification 
purposes during World War II. 


A testing medium that has been extensively explored since 1940 is the 
motion-picture film. This medium received considerable attention, for exam- 
ple, in the Air Force testing program during World War 1I (12). Further re- 
search on the development of motion-picture tests was conducted, as part of 
an extensive project on the instructional use of films, by Carpenter and his 
associates under the joint auspices of the Army and Navy (5). In this connec- 


Fig. 5. The Motion Picture as a Testing Medium: “C] 
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tion, special equipment was developed for the recording and immediate scor- 
ing of subject responses during motion-picture tests. Figure 5 shows the in- 
dividual "response stations" to be used by each subject in the group being 
tested (5). The subject inserts his hand under the cover and registers each 
response by depressing the appropriate key. 

Although presenting certain technical problems in the preparation of the 
film and in the administration of the test—problems relating to seating ar- 
rangements, lighting. and the like—motion-picture tests have many advan- 
tages. If combined with a sound track, such tests can achieve a high degree of 
standardization in instructions and presentation of stimuli. The time allowed 
for each item can likewise be controlled. Moreover, certain perceptual prob- 
lems involving motion can be effectively presented through this medium. The 
realism of the situations portrayed by a motion picture is a further asset in 
Certain types of tests. 

Mention should also be made of what is undoubtedly the newest of testing 
mediums, television. Preliminary research suggests that this, too, is a promis- 
ing medium for highly standardized, large-scale testing programs (cf. 17). 
One advantage of televised tests is the speed with which they can be revised, 
as contrasted to films. In a rapidly changing field, television can thus combine 
timeliness with the high degree of standardization of procedure offered by 
the film medium. Figure 6 shows a group of subjects taking a televised test 
and recording their responses on specially devised response units. 

A further distinction made in the classification of psychological tests is 
that between language and non-language tests. In the non-language test, no 
language is required, either written or spoken, in either the instructions or the 
test items. To be sure, language may be and often is employed in administer- 
ing these tests, but alternative non-language procedures are available for spe- 
cial circumstances. Non-language tests are especially designed for the illiter- 
ate, foreign-speaking. deaf, or others who for any reason are unable to take a 
language test. A non-language test can be either a performance or a paper- 
and-pencil test. In the latter, the test content generally consists of pictures, 
diagrams, and non-linguistic symbols, the subject being required to respond 
by making relatively simple marks. Instructions are given by gesture, pan- 
tomime, and demonstrations involving charts and diagrams. The Army Beta, 
developed during World War I, was the first group non-language test for 
measuring intelligence. 

Tests have also occasionally been classified with reference to their pre- 
dominant content, as verbal, numerical, pictorial, spatial, and the like. Such 
à superficial characterization is, however, gradually giving way to more pre- 
Cise descriptions in terms of the factorial composition of tests. Pictorial mate- 
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Fig. 6. Television as a Testing Medium: Television Screens, Sound Equipment, and 
Individual Response Indicators in Use. (Cf. 17; courtesy R. T. Rock.) 


rials, for example, can be employed to measure verbal comprehension. Thus, 
in a number of tests for preschool and primary grade children, the subject is 
required to mark the picture that fits the word, phrase, or sentence spoken 
by the examiner. Another objection to classifying tests in terms of content 
stems from the fact that individuals may employ different methods to per- 
form the same task. Consequently, a problem involving spatial content, such 
as geometric forms, may be solved by some subjects through sp 
tion and by others through verbal reasoning. 

Finally, mention should be made of the differentiation between speed and 
power tests. A pure speed test is one in which individual differences depend 
entirely upon speed of performance. Such a test is ordinarily constructed from 
items of uniformly low difficulty level, all of which are well within the ability 
of the subjects taking the test. The time limit is then made so short that no one 


atial visualiza- 
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can finish all the items. Under such conditions, each person's score reflects 
only the speed with which he worked. A pure power test. on the other hand, 
has a time limit long enough to permit everyone to attempt all items. The 
difficulty of the items is steeply graded. and the test includes some items too 
difficult for anyone to solve, so that nobody can get a perfect score. It will be 
noted that both speed and power tests are designed to prevent the achieve- 
ment of perfect scores. The reason for such a precaution is that perfect 
scores are indeterminate, since it is impossible to know how much higher 
the individual's score would have been if more items. or more difficult items, 
had been included. To enable each individual to show fully what he is able 
to accomplish, the test must provide adequate ceiling. either in number of 
items or in difficulty level. 

The distinction between power and speed tests is one of degree. rather 
than being a twofold division. Most tests actually depend upon both power 
and speed, in varying proportions. It is important to know the extent to 
Which speed and power enter into performance on any particular test, Not 
only is such information essential for the proper interpretation of scores on 
any given test, but it is also needed in the technical evaluation of the test. As 
will be seen in Chapter 5. the estimation of certain forms of test reliability 
may be seriously in error if the role of speed is ignored. 


SOURCES OF INFORMATION ABOUT TESTS 


any branch of testing needs to keep informed re- 
antly being developed. The large number availa- 


The active test user in 
garding the new tests const 
ble, as well as the rapidity with which revisions and new tests appear, makes 
the task of locating pertinent material a particularly difficult one. Familiarity 
with at least the major sources of information is thus a necessity for any- 
One interested in tests. 

One of the most import 
Yearbooks edited by Buros (4). These yearbooks cover nearly all com- 
educational, and vocational tests published 


ant sources is the series of Mental Measurements 


Mercially available psychological. 
in English-speaking countries. The covera 
and-pencil tests. Each yearbook includes tests published during a specified 
period, thereby supplementing rather than supplanting the earlier yearbooks. 
Thus, The Fifth Mental Measurements Yearbook, published in 1959, is con- 
ng between 1952 and 1958. The first two publica- 
i 936, were simply bibliographies 


ge is especially complete for paper- 


cerned with tests appeari 
tions in the series, appearing in 1935 and 1 
of tests, Beginning in 1938, however. the yearbook assumed its current form, 


Which includes critical reviews of most of the tests by one or more test ex- 
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perts. Such data as publisher, price. forms, and age or other Viu 
of subjects for whom the test is suitable are regularly Bente Another valua à e 
feature of the yearbooks is their listing of books dealing With testing and re- 
lated fields, together with excerpts from published review of sach book. 

A bibliography on psychological tests, prepared by Hildreth in 1939 (13), 
covers a 50-year period and includes over four thousand titles: A Supplement 
(14) published in 1945 contains about one thousand additional items, In 
this bibliography, an effort was made to list all available standardized tests. 
with the exception of sensory and physical measures. The entries are grouped 
under such broad headings as intelligence, vocational, psychomotor, and 
personality tests. Besides tests published in America, some foreign ones are 
also included, principally from England, France, and Germany. Tests not 
commercially available are likewise listed, provided they are adequately de- 
scribed in a printed source. Other compilations published at about the same 
time as the Hildreth bibliography are An Annotated Bibliography of Mental 
Tests and Scales by Wang (22) and An Index of P. 
Testing by South (19). 

Information regarding many tests can also be obtained from handbooks 
that survey tests for special purposes. An early classic is Whipple's Manual of 
Mental and Physical Tests, first published in 1910 (24). Detailed descrip- 
tions of many sensory, motor, and simple psychological tests 
this source. A much more recent publication is the book by 
which surveys Psychological tests with special reference to their 
vocational counseling. A handbook by Dorcus and Jones ( 
tests for which validation data in terms of an industrial criterion are availa- 
ble; this collection would thus be of special interest in connection with 
personnel selection. Surveys of tests have also been prepared within such spe- 
cialized areas as mechanical aptitude, clerical aptitude, educational achieve- 
ment, personality, and projective techniques. References to these sources 
will be cited in the appropriate chapters dealing with such tests. 

Attention should also be called to the li 
lished tests Provided by a number of 
Psychological Abstracts lists n 
lished tests are 
logical Measurer 


eriodical Literature on 


are given in 
Super (21), 
usefulness in 
9) is devoted to 


stings and reviews of newly pub- 
Psychological and educational journals. 
ew tests in a special section. Recently pub- 
also listed in individual issues of Educationa 


nent, a journal containing many articles on the construction, 
use, and evaluation of tests, Short r 


i ; eviews of new tests appear from time to 
time in Such publications as the Journal of Consulting Psychology. A com- 

of all types of Psychological and educational tests 
ears in the February issue of the Review of Educa- 


l and Psycho- 
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tional Research. Initiated in 1932, this triennial cycle includes the years 
1953. 1956. 1959, and so on. Since 1954, Personnel Psychology has in- 
cluded a section entitled “Validity Information Exchange" which reports in 
standard summary fashion any new data on the validity of specific tests. 
Still another section, on "Normative Data Information Exchange," was added 
in 1956. Current information on tests can also be found in the chapter on 
"Individual Differences" in each volume of the Annual Review of Psychol- 
ogy. 

Finally, it should be noted that the most direct source of information re- 
garding specific current tests is provided by the catalogues of test publishers 
and by the manual that accompanies each test. A comprehensive list of test 
publishers, with addresses, can be found in the latest Mental Measurements 
Yearbook. For ready reference, the names and addresses of some of the 
larger American publishers and distributors of psychological tests are given 
in the Appendix (p. 638). Catalogues of current tests can be obtained from 
these publishers on request. Manuals and specimen sets of tests can be pur- 


chased by qualified users (see Ch. 3); 

The test manual should provide the essential information required for ad- 
Ministering, scoring, and evaluating a particular test (cf. 1, 7, 10, 18, 20). 
In it should be found full and detailed instructions, scoring key. norms, and 
data on reliability and validity. Moreover. the manual should report the 
number and nature of subjects on whom norms, reliability, and validity were 
established, the methods employed in computing indices of reliability and 
Validity, and the specific criteria against which validity was checked. In the 
event that the necessary information is too lengthy to fit conveniently into the 
manual. references to the printed sources in which such information can be 
readily located should be given. The manual should, in other words, enable 
the test user to evaluate the test before choosing it for his specific purpose. 
It might be added that many test manuals still fall short of this goal. But 
Some of the larger and more professionally oriented test publishers are 
giving increasing ‘attention to the preparation of manuals that meet adequate 
Scientific standards. An enlightened public of test users provides the firmest 
assurance that such standards will be maintained and improved in the future. 

A succinct but comprehensive guide for the evaluation of psychological 
tests is to be found in the Technical Recommendations for Psychological 
Tests and Diagnostic Techniques (1). prepared and officially adopted by 
the American Psychological Association. These recommendations represent 
à summary of desirable practices in test construction, based upon the current 
State of knowledge in the field. They are concerned with such questions as 
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reliability, validity, the establishment of norms, and the preparation of a test 
manual. Relevant portions of these recommendations will be discussed else- 
where in the book, in connection with specific topics. 


N 


Ek 


12. 


13. 


. Hildreth, Gertrude H. A bibliography of mental te. 


- Jackson, R. W. B., 


. Melton, A. W. (Ed.) Apparatus tests. 
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CHAPTER 3 


Use of Psychological Tests 


"May I have a Stanford-Binet blank? I'd like to find my little sister's IQ. The 
family think she's precocious." 


“Last night I answered the questions in an intelligence test published in our news- 
paper, and I got an IQ of 80— I think psychological tests are silly.” 


“Td like to borrow the Ishihara color-blindness test to show my brother. He's 


applying for a Navy commission and would like some practice so he can pass 
that test." 


"My roommate is studying psych. She gave me a personality test and I came out 
neurotic. l've been too upset to go to class ever since." 


“I represent the school paper. We'd like a list of the 1Q's of the entering fresh- 
men to publish in our first Fall issue." 


The above remarks are not imaginary. Each is based on a real incident, 
and the list could easily be extended by any psychologist. Such remarks il- 
lustrate potential misuses of psychological tests in such Ways 
tests worthless or to hurt the individual. Like any scientific instrument or 
precision tool, psychological tests must be properly used to be effective. In 
the hands of either the unscrupulous or the well-meaning but uninformed 


as to render the 


user, such tests can cause serious damage. 


CODE OF PROFESSIONAI 
TO PSYCHOLOGICAL TI 


ETHICS PERTAINING 
S 


In order to circumvent the misuse of psychological tests, 
necessary to erect a number of safeguards 
and the test scores. 


it has become 
around both the tests themselves 
The distribution and use of psychological tests con- 
stitutes a major area in Ethical Standards of Psychologists (3), the code of 


professional ethics officially adopted by the American Psychological Associa- 
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tion. It will be helpful at this point to take a quick look at some of the 
highlights in the relevant portions of this code. Some of these points are also 
covered in Technical Recommendations for Psychological Tests and Diag- 
nostic Techniques (2). cited in the preceding chapter. 

The first and most fundamental principle is that the sale and distribution 
Of tests should be restricted to qualified users. The necessary qualifications 
Will, of course, vary with the type of test. Thus, a relatively long period of 
intensive training and supervised experience is required for the proper use 
of individual intelligence tests and most personality tests, and a minimum of 
specialized psychological training is needed in the case of educational 
achievement or vocational proficiency tests. It should also be noted that stu- 
dents who take tests in class for instructional purposes are not usually 
equipped to administer the tests to others or to interpret the scores properly. 

Test scores should likewise be released only to persons qualified to in- 
terpret them. When an individual is given his own score, not only should the 
Score be interpreted by a properly qualified person, but facilities should also 
be available for counseling any individual who may become emotionally dis- 
turbed by a knowledge of his score. For example, a college student. might 
become seriously discouraged when he learns of his poor performance on a 
Scholastic aptitude test. A gifted school child might develop habits of laziness 
and shiftlessness, or he might become uncooperative and unmanageable, if 
he discovers that he is much brighter than any of his associates. A severe 
Personality disorder may be precipitated when a maladjusted individual is 
given his score on a personality test. Such detrimental effects may, of course, 
Occur regardless of the correctness or incorrectness of the score itself. Even 
When a test has been accurately administered and scored, and properly in- 
lerpreted, a knowledge of such a score without the opportunity to discuss it 
further may be detrimental to the individual. The possible harm is further 
compounded if the score itself is in error. 

^ question arising particularly in connection with personality tests is that 
Of "invasion of privacy." In so far as some tests of emotional, motivational, 
9r attitudinal traits are necessarily disguised, the subject may reveal char- 
cteristics in the course of such a test without realizing that he is so doing. 
Although there are few available tests whose approach is subtle enough to 
fall into this category, the possibility of developing such indirect testing 
Procedures imposes a grave responsibility upon the psychologist who uses 
them. For purposes of testing effectiveness it may be necessary to keep the 
€xaminec in ignorance of the specific ways in which his responses on any one 
lest are to be interpreted. Nevertheless, a person should not be subjected to 
any testing program under false pretenses. Of primary importance in this 
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connection is the obligation to have a clear understanding with the examinee 
regarding the use that will be made of his test results. Two statements con- 
D 


tained in Ethical Standards of Psychologists are especially relevant to this 
problem (3, p. 280): 


The psychologist who asks that an individual reveal personal information in the 
course of interviewing, testing, or evaluation, or who allows such information to be 
divulged to him, does so only after making certain that the person is aware of the 


purpose of the interview, testing, or evaluation and of the ways in which the infor- 
mation may be used. 


The psychologist in industry, education, and other situations in which conflicts 
of interest may arise among varied parties, as between management and labor, de- 


fines for himself the nature and direction of his loyalties and responsibilities and 
keeps these parties informed of these commitments. 


Still other professional problems concern the marketing of psychological 
tests by authors and publishers. Tests should not be released prematurely for 
general use. Nor should any claims be made regarding the merits of a test 
in the absence of sufficient objective evidence. When a test is distributed 
early for research purposes only, this condition should be clearly specified 
and the distribution of the test restricted accordingly. As already indicated in 
the preceding chapter, the test manual should provide adequate data to per- 
mit an evaluation of the test itself, as well as full information regarding ad- 
ministration, scoring, and norms. The manual should be a f 


of what is known about the test, rather than 
the test in a favorable light. 


publisher to revise tests and 
The rapidity with which a test 
with the nature of the test. 


Finally, tests or parts of tests should not be published in 
popular magazine, or book, either for descriptive purposes or fi 
tion. Under such conditions, self-evaluation would not onl 
such drastic errors as to be well-nigh worthless, but it might also be detri- 
mental to the individual for the reasons already discussed. Moreover, any 
publicity given to specific test items will tend to invalidate the future use of 
the test with other persons. It might also be added that presentation of test 
materials in this fashion tends to create an erroneous and distorted picture 
of psychological testing in general. Such publicity may foster either naive 


credulity or indiscriminate resistance on the part of the public toward all 
psychological testing.! 


actual exposition 
à selling device designed to put 
It is the responsibility of the test author and 
norms often enough to prevent obsolescence. 
becomes outdated will, of course, vary widely 


à newspaper, 
or self-evalua- 
y be subject to 


! For teaching or expository purposes, it is permissible to reproduce sample test items con- 
Structed so as to resemble those of the lest itself, or items which are used for demonstration 
Purposes only in the test. This is the Practice followed, for example, in the present text. 
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PRINCIPAL REASONS FOR CONTROLLING THE 
USE OF PSYCHOLOGICAL TESTS 


It should be apparent that there are two principal reasons for controlling 
the use of psychological tests: (a) to prevent general familiarity with test 
Content, which would invalidate the test; and (b) to insure that the test is 
used by a qualified examiner. Obviously, if an individual were to memorize 
the correct responses on a test of color blindness, such a test would no longer 
be a measure of color vision for him. Under these conditions, the test would 
be completely invalidated. Test content clearly has to be restricted in order 
to forestall deliberate efforts to fake scores. 

In other cases, however, the effect of familiarity may be less obvious, or 
the test may be invalidated in good faith by misinformed persons. A school 
teacher, for example, may give her class special practice in problems closely 
Tesembling those on an intelligence test, "so that the pupils will be well pre- 
pared to take the test." Such an attitude is simply a carry-over from the 
Usual procedure of preparing for a school examination. When applied to an 
intelligence test, however, it is likely that such specific training or coaching 
Will raise the scores on the test without appreciably affecting the broader 
area of behavior the test tries to sample. Under such conditions, the validity 
Of the test as a predictive instrument is reduced. 

The need for a qualified examiner is evident in each of the three major 
aspects of the testing situation—selection of the test, administration and 
Scoring, and interpretation of scores. Tests cannot be chosen like lawn mow- 
crs, from a mail-order catalogue. They cannot be evaluated by name, au- 
thor, or other easy marks of identification. To be sure, it requires no psy- 
chological training to consider such factors as cost, bulkiness and ease of 
transporting test materials, testing time required, ease and rapidity of scoring, 
and the like. Information on these practical points can usually be obtained 
from a test catalogue and should be taken into account in planning a testing 
Program. For the test to serve its function, however, an evaluation of its 
technical merits in terms of such characteristics as validity, reliability, and 
norms is essential. Only in such a way can the test user determine the ap- 
Propriateness of any test for his particular purpose and its suitability for the 
type of persons with whom he plans to use it. 

The introductory discussion of test standardization in Chapter 2 has al- 
ready suggested the importance of a trained examiner. An adequate realiza- 
tion of the need to follow instructions precisely, as well as a thorough famili- 
arity with the standard instructions, is required if the test scores obtained by 
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different examiners are to be comparable, or if any one individual's score is 
to be evaluated in terms of the published norms. Careful control of testing 
conditions and effectiveness in establishing rapport are also essential. Simi- 
larly, incorrect or inaccurate scoring may render the test score worthless. In 
the absence of proper checking procedures, scoring errors are far more likely 
to occur than is generally realized. 
The proper interpretation of test scores requires a thorough understanding 
of the test, the individual, and the testing conditions. What is being measured 
can be objectively determined only by reference to the specific procedures 
in terms of which the particular test was validated. Other information, per- 
taining to reliability, nature of the group on which norms were established, 
and the like, is likewise relevant. Some background data regarding the in- 
dividual being tested are essential in interpreting any test score. The same 
score may be obtained by different persons for very different reasons. The 
conclusions to be drawn from such scores would therefore be quite dissimilar. 
Finally, some consideration must also be given to special factors that may 


have influenced a particular score, such as unusual testing conditions, tempo- 
rary emotional or physical st 
sets, and “ 
with tests, 


ate of the subject, recent experiences, response 
test sophistication" or extent of the subjects previous experience 


Some of the special problems associated with each of the ma 
test administration and interpretation cited 
sections that follow. 


jor aspects of 
above will be considered in the 


MOTIVATION, TEST ANXIETY, AND RAPPORT 


Motivation. Underlying all tests of ability is the 
is "doing his best." Consequently, if conditions 
regard, every subject should be motivated to put forth his maximum efforts 
on the test. A fairly large number of studies have been concerned with the 
possible influence of different incentives upon test performance. Almost any 
added incentive, however mild. may raise or lower the scores of at least 
certain groups. In one pair of studies, for example, praise and reproof were 
both found to improve the performance of school children on group intelli- 
gence tests, as well as on arithmetic tests (40, 41 ). Praise, however, proved 


to be more effective than reproof, especially when the incentives were ad- 
ministered Tepeatedly. Similar verbal incentives, including encouragement 
and discouragement, Sarcasm, ridicule, and 


f 1 i “razzing,” have been used by 
other investigators in the eflort to influence the subject's self-confidence and 
to arouse feelings of success or failure, Among the other incentives studied 


assumption that the subject 
are to be kept uniform in this 
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may be mentioned individual competition, group rivalry, knowledge of re- 
Sults, presence of observers, presence of co-workers, prizes, and monetary 
rewards. 

With emotionally disruptive conditions, scores are likely to drop even for 
subjects accustomed to taking tests. In one study (30), for example, eighth- 
grade pupils were given the Stanford-Binet and retested two weeks later 
under conditions designed to evoke discouragement. On this retest, the chil- 
dren scored significantly lower than a control group retested under normal 
conditions. Similar results were obtained when three groups of college stu- 
dents were given tests of verbal analogies, arithmetic, and cancellation under 
the following conditions: (a) normal, control conditions; (5) with instruc- 
tions to work as accurately as possible; (c) with instructions to work as 
rapidly as possible and with a buzzer sounding every 30 seconds, at which 
time the examiner stated how many items should have been completed (66). 
The third set of conditions was designed to arouse tension. It also tended to 
induce feelings of inferiority and failure, since the level of attainment speci- 
fied at each interval was set just beyond the capacities of the subjects. The 
results of this study showed that the group working under "tension" condi- 
tions made significantly more errors than the other two groups. The instruc- 
tions to be accurate produced no appreciable effect. 

In the administration of such tests as the Stanford-Binet, the standard 
Procedure is to continue with the presentation of increasingly difficult tasks 
Until all tests within a single year level are failed. It has been suggested that 
With some children such a procedure may produce a mounting awareness of 
failure, which prevents the child from doing as well as he might on some of 
the later tasks (42). To test this hypothesis, one investigator altered the 
Procedure in such a way that every failure was followed by an easier 
item (42), Poorly adjusted children were found to score higher when tested 
by this method than by the standard procedure, but well-adjusted children 
did equally well by either method. It should be noted, of course, that stand- 
ardized test procedure should not be altered in this manner under ordinary 
Circumstances. when it is desired to evaluate an individual’s performance in 
terms of norms. The study is cited merely to illustrate the part motivational 
factors may play in the test performance of maladjusted subjects. 

Another investigation of interest in this connection was conducted on 
Kindergarten children (49). The subjects were given the Stanford-Binet upon 
entering kindergarten and were retested two months later with a parallel 
form, A significant rise in mean IQ was found on the retest. This gain was 
attributed by the investigator largely to the effect of the kindergarten experi- 
ence in reducing shyness, fear of strangers, and other attitudes inhibiting oral 
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expression. Support for such a hypothesis was found in the fact that the 
average test-retest improvement in manipulatory tasks was only 4.7 per cent, 
as contrasted to a gain of 11.2 per cent in the oral items. In certain ethnic 
minority groups, such as the American Negro, attitudes that inhibit oral 
speech may be fostered by the culture, rather than gradually eliminated. It 
has been suggested that overt verbalization increases the possibility of incur- 
ring the hostility of the dominant social group. Consequently, habits of in- 
articulateness might be encouraged by cultural factors in such minority 
groups (11). Verbalization, of course, plays as important a part in the in- 
dividual's general intellectual development, as in his test performance. Hence 
more than just test scores would be adversely affected by such a culturally 
imposed handicap. 

Most middle-class American school children and college students today 
are not only fairly test-wise, but they are also generally motivated to suc- 
ceed in academic work and in test situations. In such groups, cooperation 
can be obtained with little difficulty. Special motivational problems are en- 
countered, however, in testing certain other groups. Emotionally disturbed 
persons, prisoners, or juvenile delinquents, especially when tested in an in- 
stitutional setting, are likely to manifest a number of unfavorable attitudes, 
such as suspicion, insecurity, fear, or cynical indifference. Specific abnormal 
factors in the past experience of such persons are also likely to influence 
their test performance adversely. Such individuals may, for example, have 
developed feelings of hostility and inferiority toward any academic material, 
as a result of early failures and frustrations in school (cf. 62, Chs. 4 and 6; 
64). 

There is also ample evidence to suggest that test-taking motivation varies 
widely in different ethnic and socioeconomic groups (cf. 5, p. 552; 23, 
pp. 20-21). One illustration is provided by the following statement, appear- 
ing in a summary of socioeconomic differences in test performance: 


Observation of the performance of lower-class children on speed tests leads one 
to suspect that such children often work very rapidly through a test, making re- 
sponses more or less at random. Apparently they are convinced in advance that 
they cannot do well on the test, and they find that by getting through the test 
rapidly they can shorten the period of discomfort which it produces (23, p. 21). 


It is interesting to note that a reaction almost identical with that described 
above was observed among Puerto Rican school children tested in New 
York City (6) and in Hawaii (65). 


Test Anxiety. Closely related to test-taking motivation is the question of 


test anxiety. The nature, correlates, and effects of such anxiety have been 
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studied with both school children and college students, much of this research 
having been conducted by Sarason and his associates at Yale (50, 62, 63, 
77). First, a questionnaire was constructed to assess the individual's test- 
taking attitudes. The children's form, for example, contains 43 items such 
às the following: 


Do you worry a lot before taking a test? 


When the teacher says she is going to find out how much you have learned, does 
your heart begin to beat faster? 


While you are taking a test, do you usually think you are not doing well? 


Children were also rated by their teachers for overt expressions of anxiety. 
including such behavior as fidgeting when called upon to recite, working 
better alone than in front of a class, and performing poorly under time 
Pressure, 

In one investigation on 600 children in grades two to five, anxiety ques- 
tionnaires and teachers ratings were significantly correlated in all grades 
(63). Test anxiety also tended to increase from the second to the fifth grade. 
Of primary interest is the finding that both school achievement and intelli- 
gence test scores yielded significant negative correlations with test anxiety. 
Such correlations support the hypothesis that children who become over- 
anxious in a test situation are thereby handicapped in their performance. 
Such correlations, of course, do not indicate the direction of causal relation. 
It is possible that children develop test anxiety because they do poorly on 
tests and have thus experienced failure and frustration in previous test situa- 
tions, 

Sarason and his co-workers, however, point to several lines of evidence 
Suggesting that at least some of the association results from the deleterious 
effects of anxiety upon test performance. In one investigation (77), high- 
anxious and low-anxious children equated in intelligence test scores were 
Biven repeated trials in a learning task. Although initially equal in the learn- 
ing test, the low-anxious group improved significantly more than the high- 
anxious, Supporting evidence is also provided bya series of learning experi- 
Ments on college students (50). For example, ego-involving instructions, 
Such as telling subjects that everyone is expected to finish in the time al- 
lotted, had a beneficial effect on the performance of low-anxious subjects, 
but a deleterious effect on that of high-anxious subjects. 


It thus appears that test anxiety does interfere with effective learning and 


test performance. More research is needed, however, before a definitive 


Un 
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statement can be made. It is likely, for instance, that the relation between 
anxiety and performance is curvilinear, a slight degree of anxiety hi. 
beneficial, a high degree detrimental. The finding that ego-involving Ms d 
tions exert a positive influence on the performance of initially lon aniou 
subjects and a negative influence on the performance of initially high-anxious 
subjects fits in with this hypothesis. 

What, specifically, do the experimental findings on motivation and test 
anxiety imply regarding testing procedure? First, such findings highlight the 
importance of adhering to the prescribed motivating conditions in administer- 
ing any test. The addition of any other incentive, however mild, may raise or 
lower scores appreciably, especially with certain types of subjects. Standing 


over the subject with a stop watch, peering over his shoulder, uttering some 
word of exhortation or criticism, or tellin 


g the subject how much time re- 
mains 


- all illustrate ways in which motivating conditions might be inadvert- 
altered. A second implication is that, in the interpretation of scores. 
any unusual motivating conditions should be taken into consideration. This 
is especially true for subjects whose experiential background is unlike that of 
the standardization sample. Finally, it is apparent that the est 
rapport, prior to the administration of a test, is 
ing procedure. 
certain th 


ently 


ablishment of 
àn important part of the test- 
In so far as the situation permits, the examiner must make 
at the subject is ready to do his best before the test is begun. 

Rapport. The specific techniques for establishing rapport vary somewhat 
with the nature of the test and the type of subjects to be tested. Thus, in 
testing preschool children,” special factors to be considered include shyness 
with strangers, distractability, and negativism. A friendly, cheerful, and re- 
laxed manner on the part of the examiner helps to reassure the child. The 


shy, timid child needs more preliminary time to become familiar with his 
surroundings. For this reason it is better for 


the examiner not to be too 
demonstrative at the outset, 


but rather to wait until the child 
make the first contact. Test periods should be brief 


varied and intrinsically interesting to the child. T 
sented to the child as a game, and his c 
task is introduced. A certain flexibilit 


level, because of possible refusals, 
of negativism. 
Children 


is ready to 
. and the tasks should be 
he testing should be pre- 
uriosity aroused before each new 
y of procedure is necessary at this age 


loss of interest, and other manifestations 


ary school present 
d. The "game" ap- 


2 A detailed description of recommended procedures for testing young children can be found 
in 29. This account has been reprinted by Goodenou 


gh (28, pp. 298.304). 


wn 
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older school child can usually be motivated through an appeal to his com- 
petitive spirit and his desire to do well on a test. It should be borne in 
mind, however, that every test presents an implied threat to the individual's 
prestige. Some reassurance should therefore be given at the outset. It is help- 
ful to explain, for example, that no one is expected to finish or to get all the 
Items correct. The individual might otherwise experience a mounting sense 
of failure as he finds that he is unable to finish any part of the test within the 
time allowed. 

It is also desirable to eliminate the clement of surprise from the test situa- 
tion as far as possible. since the unexpected and unknown are likely to 
Produce anxiety. Many group tests provide a preliminary explanatory state- 
ment which is read to the group by the examiner. An even better procedure 
IS to announce the tests a few days in advance and to give each subject a 
Printed statement. that explains the purpose and nature of the tests. offers 
Beneral suggestions on how to take the tests, and possibly contains a few 
Sample items, Such printed statements are used regularly by the College 
Entrance Examination Board (14), Educational Records Bureau (47, pp- 
345-347), and other organizations. A more general booklet entitled Taking 
4 Test (51), published by World Book Company. is designed for use with 
Senior high school and college students. 

In the absence of any formal printed statement, it can probably be as- 
Sumed that any announcement is better than none. Care should be taken, of 
Course, not to arouse anxiety by making the coming test sound like a formida- 
ble event, If such announcements are made in an objective, straightforward, 
and matter-of-fact manner, however, they will serve to reduce rather than to 
heighten tension and worry. In a group of junior high school students, for 
example, increases of approximately one to two per cent in mean score were 
found as a result of a two-day advance notice of the test (75). 

The testing of college students and adults presents some of the same prob- 
lems as the testing of school children, including the need for reducing threat 
and Surprise. In addition, adults out of school are generally more resistant 
fo tests, Unlike the school child, the adult is not so likely to work hard at a 
task merely because it is assigned to him. It therefore becomes more im- 
Portant to “sell” the purpose of the tests to the adult, although high school 
and college students also respond to such an appeal. Cooperation of the sub- 
Jeet can usually be secured by convincing him that it is in his own interests 
‘© obtain a valid score, i.e., a score correctly indicating what he can do, 
Tather t Aimating or underestimating his abilities. Most subjects can 
readily [ss pos n that an pe decision, which "li result 
from invalid test scores, would mean subsequent failure, loss of time, and 
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frustration for them. This approach is usually effective not only in motivating 
the subject to try his best, but also in preventing cheating, since the subject 
realizes that he himself would eventually be the loser. It is certainly not in 
the best interests of the individual to be admitted to a course of study for 
which he is not qualified or assigned to a job he cannot perform. 


INFLUENCE OF PRACTICE AND COACHING 
UPON TEST PERFORMANCE 


Breadth of Influence. In evaluating the effect of coaching or practice upon 
test scores, a fundamental question to consider is the breadth of such in- 
fluence. Is the improvement limited to the specific items included in the test, 
or does it extend to the broader area of behavior that the test is designed to 
predict? The answer to this question represents the difference between coach- 
ing and education. Obviously any educational experience the individual un- 
dergoes, either formal or informal, in or out of school, should be reflected 
in his performance on tests sampling the relevant aspects of behavior. Such 
broad influences would in no way invalidate the test, since the test score 
would in such cases present an accurate picture of the individual's standing 
in the abilities under consideration. The difference is, of course, one of 
degree. An influence is not either narrow or broad, but obviously varies 
widely in scope, from factors affecting only a single administration of a single 
test, through those affecting performance on all items of a certain type, tO 
those influencing the individual's performance in the large majority of his 
activities. From the standpoint of effective testing, however, a workable dis- 
tinction can be made. Thus it can be stated that a test score is invalidated 
only when a particular experience raises it without appreciably affe 
behavior domain that the test is designed to predict.* It is the | 
ing type of influence that will now be considered with refere 
and practice. 

Coaching. A number of investigations have dealt with the effects of coach- 
ing upon test performance. Early studies with the Stanford-Binet demon- 
strated that children can be taught intelligence test items they were formerly 
incapable of executing correctly (12, 33). Large and significant gains in 10 
were obtained in one experiment as a result of two hours of coaching O” 
tests the child had failed on a previous administration of the Stanford 
Binet (33). Groups coached on similar rather than on identical 


showed smaller gains. The effects of coaching declined on successive 
At the end of three years, no signific 


cting the 
atter, invalidat- 
nce to coaching 


material 
retest: 
ant differences remained between the 


3 For a fuller discussion of this point, cf. 4, 
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groups coached on identical and on similar material or between these and 
the control group which had been retested with no intervening coaching. Such 
à result is to be expected, partly because of forgetting and partly because the 
nature of the Stanford-Binet items varies at different age levels. The chil- 
dren were therefore being tested on tasks unlike those on which they had 
been coached. 

More recent research on several group as well as individual tests has 
likewise shown that, in general, coaching produces significant gains in mean 
Scores (20, 24, 37, 43, 76, 83, 84, 85). Many of these studies were con- 
ducted by British psychologists, who have been concerned about the effects 
Of practice and coaching upon the tests used in assigning 11-year-old chil- 
dren to different types of secondary schools. As might be expected, the ex- 
tent of improvement depends upon the ability and earlier educational ex- 
periences of the subjects, the nature of the tests, and the amount and type 
Of coaching provided. Subjects with deficient educational backgrounds are 
more likely to benefit from special coaching than are those who have had 
Superior educational opportunities and are already prepared to do well on 
the tests. It is obvious, too, that the closer the resemblance between test 
Content and coaching material, the greater will be the improvement in test 
Scores, * 

In America, the College Entrance Examination Board has felt concern 
Over the prevalence of ill-advised coaching courses for college applicants. To 
Clarify the issue, the College Board conducted several well-controlled ex- 
Periments to determine the eflects of coaching on its Scholastic Aptitude 


Test (24), In a formal statement subsequently issued by the College Board 


trustees, it was pointed out that, although intensive drill on the types of 


Material covered by this test may produce a significant mean rise in score 
In certain groups, the amount of probable gain in individual cases is not such 
as to affect college admission decisions (15). E 

The distinction between coaching and education is highlighted by an in- 
Vestigation with kindergarten children (39). Two kindergarten classes total- 
ing 53 children were put through a 14-week program based on the Learning 
to Think series (70, 71). Prepared by the author of the tests of Primary 
Mental Abilities for Ages 5 to 7 (see Ch. 13), this training material is 
Closely similar to the test content. Before and after the training program, 
the children were given both the Primary Mental Abilities tests and the 
Wechsler Tätern Scale for Children (see Ch. 12). Two control classes 
of 54 children took the same pretests and posttests, with no intervening train- 
Ing. All groups improved on the second testing. The trained subjects, how- 
ever, improved no more than the controls on the Wechsler test, although 
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they improved significantly more than the cóntrols on Hi Primary rai 
Abilities tests. Such findings suggest that the training provided by the teu 
ing to Think materials operates as specific coaching on the Primary Menta 
Abilities tests rather than serving a broader educational function. 

Practice. The effects of sheer repetition, or Practice, upon test performance 
are similar to the effects of coaching, but usually 
be noted that practice, as well as coaching, 
since the subjects may employ entirel 
same problems. | 


less pronounced. It should 
may alter the nature of the test. 
y different work methods in solving the 
n general, those tests in which work methods change little 
with repetition show little improvement in score as 


in which performance undergoes marked 
show | 


a result of practice; those 
qualitative changes with repetition 
arge gains in score. Some evidence for this relationship may be found 
in a study on college students in which t 


he objective test scores were sup- 
plemented with qualit 


ative observations of performance and with introspec- 
tive reports of the methods employed in solving problems (32). In this 
study, tests measuring speed of simple movements and tests of auditory dis- 
crimination showed little or no practice effect. In taking such tests the sub- 
jects performed essentially the same functions on initial and later trials. 
Tests involving precision of movement and those depending upon prior in- 
formation, such as Vocabulary tests, yielded retest gains ranging from 6 te 
25 per cent of the initial scores. Maze and block-design tests, in which a 
generalized rule could be formulated during the initial test, showed increases 
of from 76 to 200 per cent. Even greater improvements were found in 
mechanical aptitude tests in which a mechanical object had to be assembled 


from its constituent parts. In such tests, the earlier solutions could be re- 
called and reapplied to the same test materials. 


Current. intelligence. tests frequently contain 
expected to change with repetition. Retest score: 
rived from a repetition of the identical test or 
therefore be carefully scrutinized. A number of 
with the effects of the identical repetition of i 


ranging from a few d 3. 16, 19, 35, 36, 45, 79). 
adults and children, and both norm 


Both al and mentally defective subjects 
have been employed. Most of the studics h 


some data on individu 


items whose nature can be 
s on such tests, whether de- 
from a parallel form, should 
studies have been concerned 


ntelligence tests over periods 
ays to several years (1, 13 
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when 3500 school children were retested annually with a variety of intelli- 
gence tests (19). When the same test was re-administered in successive 
years, the median IQ of the group rose from 102 to 113, but it dropped to 
104 when another test was substituted. Because of the retest gains, the 
Meaning of an IQ obtained on an initial and later trial proved to be quite 
different. For example, an IQ of 100 fell approximately at the average of 
the distribution on the initial trial, but in the lowest quarter on a retest (19, 
P- 134). Such IQ's, though numerically identical and derived from the same 
test, might thus signify normal ability in the one instance and inferior ability 
in the other, 

Gains in score are also found upon retesting with parallel forms of the 
Same test, although such gains tend in general to be smaller. Significant 
mean gains have been reported when alternate forms of a test were ad- 
Ministered in immediate succession (68), at one-day intervals (68), and one 
Month apart (55, 56). Similar results have been obtained with British chil- 
dren (55, 56), normal and intellectually gifted American school children 
(68), and American high-school, college, and graduate students (68). Con- 
temporary test constructors recognize such a practice effect and often make 
allowances for it. In the Minnesota Preschool Scale, for example, it is sug- 
gested that 3 IQ points be deducted as a correction for practice effect when 
alternate forms are administered within a few weeks (29). 

The general problem of test. sophistication should also be considered in 
this connection. The individual who has had extensive prior experience in 
taking psychological tests enjoys a certain advantage in test performance 
Over one who is taking his first test (36, 60). Part of this advantage stems 
from having overcome an initial feeling of strangeness, as well as from hav- 
ing developed more self-confidence and better test-taking attitudes. Part is 
the result of a certain amount of overlap in the type of content and func- 
tions covered by many tests. Probably other factors also operate in more 
subtle and indirect ways. It is particularly important to take test sophistica- 
tion into account when comparing results from children in different types of 
Schools, where the extent of psychological testing may vary widely. 


MALINGERING AND CHEATING 


The problem of malingering and cheating is largely a motivational one. 
Theoretically, it should be possible to forestall attempts to fake scores on 
Psychological tests by convincing the subject that a valid score is in his own 
“St interests, on the grounds that an incorrect decision based upon invalid 
Scores will only cause him difficulties later on. With certain types of subjects, 
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however, such an appeal may not be very effective. School children, for es 
ample, may not be sufficiently farsighted to be influenced by a relatively dis- 
tant goal. Maladjusted persons will in many cases respond too emotionally 
to the test to be susceptible to such a rational argument. There are, more- 
over, a few testing situations, especially in the examination of criminal of- 
fenders and in the screening of draftees for military service, in which the 
objectives of individual subjects may be fundamentally at cross-purposes to 
those of the examiner, a conflict which no amount of rapport may be able 
to reconcile. To be sure, proper motivation and rapport should go far to- 
ward reducing malingering and cheating. But other precautionary measures 
are undoubtedly necessary. When strong motivation to achieve a certain re- 
sult is combined with a feeling of insecurity regarding the outcome of a 
given test, attempts to fake scores are likely to occur. 

The faking of psychological test scores may be conveniently considered 
under three headings, viz., (a) cheating to raise scores on ability tests, (b) 
attempts to appear in a more favorable light on personality tests, and (c) 
simulation of inferiority or abnormality on either personality or ability tests. 

Raising Scores on Ability Tests. A common example of the first type of 
cheating is the utilization of outside assistance, such as referring to books 
and notes, or copying from a neighbor. Several measures may be taken to 
reduce such practices. As is true of all types of faking, however, no preven- 
tive is completely successful. Adequate proctoring, proper seating arrange- 
ments, and the simultaneous use of alternate test forms are generally effec- 
tive. If cheating is suspected during or after the test, certain statistical checks 

: These checks are based upon an analysis 


ong the responses of the subjects sus- 
pected of copying from each other (7, 10, 21), 


Cheating to improve the score on àn ability test may also take the form of 
an unwarranted extension of time limit. This can occur through “jumping the 
gun,” or premature Starting, before the signal to begin. Thus the subject may 
Start to work on the first few items of the test Proper while the examiner is 
still giving the directions or discussing the sample items. Similarly, the sub- 
ject may continue to work after the signal to Stop is given. The best safe- 


é s ating provided by more careful test ad- 
i ng. Premature starting can also be 


of items in the test booklets, All 
instruction materials and sample items, as well as questions regarding name 


and other personal data, should be printed on a separate page, containing 
none of the items of the test proper. Moreover, the items should be so ar- 
ranged that it is necessary to turn a page when proceeding from the pre- 
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liminary material to the test proper, or when going from one separately 
timed part of the test to the next. 

Still another spurious procedure for raising scores on ability tests is the 
acquisition of prior knowledge regarding test items. In an institution for 
defective delinquents, for example, it was discovered that subjects were 
giving the "right answers" on a common individual intelligence scale (Wech- 
Sler-Bellevue ) even when an essential part of the question was omitted (8). 
Thus to the question, "How many oranges can you buy for thirty-six cents?" 
the subject would reply "Nine," although the examiner had omitted the 
Second part of the problem which normally follows, “if one orange costs 
four cents.” In this case, the items were being circulated in the institution 
through the help of those who had been tested earlier. It might be added 
that. this large-scale cheating program became organized as a result of a 
misapprehension regarding dependence of parole on intelligence test scores; 
Proper orientation regarding the purposes of the test would probably have 
done much to prevent such cheating. More specific procedures for avoiding 
the dissemination of prior knowledge include the safeguarding of test mate- 
Tials, the use of a test with several alternate forms, and scheduling the testing 
SO as to minimize the possibility of intercommunication. When communica- 
tion is likely, as in an institutional setting, a group test has certain advan- 
tages over an individual test. For the same reason, the administration of such 
à test in one large group would be preferable to the successive testing of 
Several smaller groups. ] 

"Faking Good" on Personality Tests. The second major type ot faking, 
Namely, the faking of personality test responses so as to appear in a more 
favorable light, is especially likely to occur on self-report inventories. De- 
Spite introductory statements to the contrary, most items in such inventories 
have one answer which is pretty clearly the desirable or socially acceptable 
response, Consequently there is a strong tendency for the individual to check 
What he recognizes as the "right" answer, rather than the answer which cor- 
responds to his own habitual behavior. This may occur even when there is 
No deliberate or recognized attempt to alter the score. If, in addition, the 
individual is motivated to "put his best foot forward,” as in the case of a job 
Applicant, it is quite easy for him to create the desired impression on such 
a test, 

Evidence of the success with which subjects can dissemble on personality 
inventories is plentiful (cf. 25, 31, 54, 81). A common classroom demonstra- 
tion Consists in asking different groups to fake responses in specified ways. 
For example, one section of the class is directed to answer each question as 
t would be answered by a happy and well-adjusted college student; another 
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section is told to respond in the manner of a severely maladjusted subject: 
and the last section is instructed to answer the items truthfully with reference 
to their own behavior. Or the same subjects may take the test twice. first 
with instructions to simulate in a specified way and later under ordinary 
conditions. The results of such studies clearly demonstrate the facility sith 
which the desired impression can be deliberately created on such inventories. 
To be sure. subjects of lower educational or intellectual level are probably 
less successful in disguising their responses than are the college groups on 
which most of these studies have been conducted. As long as a subject has 
sufficient education to enable him to answer a personality inventory, how- 
ever, he probably has the ability to alter his score appreciably in the de- 
sired direction. 

It is interesting to note that specific faking for a particular vocational ob- 
jective can also be successfully carried out. Thus in one study (81). the 
responses of the same group of students were compared on two administra- 
tions of a personality inventory (Bernreuter) taken a week apart. On the 
first testing, the subjects were instructed to pretend they were applying for 
the position of salesman in a large industrial organization and to answer in a 
manner designed to increase their chances of employment. On the second 
testing. the same instructions were given, but the position of librarian was 
substituted for that of salesman. When the responses were scored for the 
trait of self-confidence, a conspicuous difference was found in the distribu- 
tions of scores on the two occasions. the simulated-salesman scores being 
much higher than the corresponding librarian scores. 

That job applicants do in fact fake personality test responses was demon- 
strated in another study (31), in which the scores obtained by 
applicants were compared with the scores of a comp 
holders who were tested for research purposes only. Un 
motivating conditions, the scores of the two groups differed in the expected 
direction. Although such faking is relatively easy on the self-report type of 
personality test, there is some evidence to indicate th 
certain widely used projective techniques (cf., e.g., 80). 


hg 
"Faking Bad" on Personality or Ability Tests. The last form of faking to 
be considered 


involves the simulation of mental deficiency or emotional 
disturbance by deliberately obtaining a "poor" score on an intelligence or 
personality test. Such malingering is of special concern in military testing. ^ 
variant of this performance is the differential failure on certain portions of a 
classification battery in the effort to be assigned to a more desirable specialty. 
For example, Air Force cadets sometimes tried deliberately to miss items on 
the bombardier and navigator parts of the cl 


a group of 
arable group of job 
der these contrasting 


at it can also occur in 


assification battery, in the hope 
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that such a performance would increase their chances of being assigned to 
pilot training. 

The problems presented by such malingering on aptitude tests are similar 
to those arising from either favorable or unfavorable faking on personality 
tests. The same general approaches are therefore being explored in the effort 
to cope with all these forms of simulation. One solution is to disguise the 
Purpose of the test, as when a measure of masculinity-femininity is pre- 
sented as a survey of interests and attitudes. Such a procedure is effective 
Only in a few types of tests. Most personality test items have enough “visibil- 
ity" for the alert subject to discover the true purpose of the test. A more 
effective sort of disguise is to be found in the forced-choice technique. in 
Which the subject must choose one of two answers which appear to be equally 
acceptable or unacceptable. One of the answers. however. has been shown 
empirically to be a favorable indicator of the criterion under consideration, 
While the other answer is neutral or unfavorable.' 

Another common procedure is directed not toward the prevention but to- 
Ward the detection of faking. Essentially, this technique involves the construc- 
tion of a special key from which a malingering score can be derived. Such a 
Procedure is applicable to both aptitude and personality tests. The funda- 
Mental fact upon which the malingering keys are based is that when an in- 
dividual tries to fake his responses, he tends to overdo it. For example, in 
an intelligence test on which the individual may be trying to appear dull, he 
will Probably pass more of the difficult items and fail more of the easy items 
than a genuine mental defective. In such a situation it is difficult for the 
individual to gauge accurately the relative difficulty ae different items. A 
Malingering key can be empirically derived by comparing the responses of 
a group of subjects instructed to appear stupid with the responses of bona 
fide feebleminded cases (cf. 26. 57). : 

Similar keys have been prepared for use with certain personality tests 
(e.g., 34). These will be considered in connection with the specific tests to 
be discussed in Chapter 18. In some instances, the various malingering scores 
are simply used to determine whether the regular test scores should DE ac- 
Cepted or rejected. In other cases a numerical correction can be applied to 
the regular scores, on the basis of the subject's score on the malingering key. 


PROBLEMS OF TEST ADMINISTRATION 


A whole volume could easily be devoted to a discussion of desirable pro- 
Cedures of test administration. But such a survey falls outside the scope of 
and other techniques employed in specific personality inventories 


! A. fuller dis ‘ E 
w discussion of this 
ill be found in Chapter 18. 
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the present book./ Moreover, it is more practicable to acquire such ac 
niques within specific settings, since no one individuel would normally be 
concerned with all forms of testing. from the examination of infants to the 
clinical testing of psychotic patients or the administration of a mass testing 
program for military personnel, The present discussion will therefore deal 
principally with the common "rationale of test administration rather than 
with specific questions of implementation.* 

Advance Preparation of Examiners. The most important requirement for 
good testing procedure is advance ‘preparation. In testing there can be“ no 
emergencies. Special efforts must therefore be made to foresee and forestall 
emergencies. Only in this way can uniformity of procedure be assured. 

Advance preparation for the testing session takes many forms. Memorizing 
the verbal instructions is essential in most individual testing. Even in a group 
test in which the instructions are read to the subjects, some previous’ famili- 
arity with the statements to be read prevents misreading and hesitation, and 
permits a more natural, informal manner during test administration. The 
preparation’ of test materials is another important preliminary step. In in- 
dividual testing, and especially in the administration of performance tests. 
such preparation involves the actual layout of the necessary materials to 
facilitate subsequent use with a minimum of search or fumbling. Materials 
should generally be placed on a table near the testing table, so that they are 
Within easy reach of the examiner but do not distract the subject. When ap- 
paratus is employed, frequent periodic checking and calibration may be 
necessary. In group testing, all test blanks, Xnswer sheets. special pencils, or 


other materials needed should be carefully counted, checked, and 


arranged 
in advance of the t sting day. 


Rehearsal of procedure with one or more trial subjects is another essential 
prerequisite in both individual and group testing. In the case of group testing. 
and especially jin large-scale projects, such preparation may include the ad- 
vance briefiny/ of examiners and proctors, so th 
with the functions he is to perform. In general, the examiner reads the in- 
structions, takes care of "timing, and is in charge Wf the group in any one 


testing room. The proctors hand^out and coll 
that subjects 


at each is thoroughly familiar 


ect test materials, make certain 
are following instructions, answer individual questions of sub- 
jects within the limitations specified in the manual, and prevent cheating. 

Some attention should also be given to the selection of a suitable testing 
room. Such a room should be free from undue noise and distraction, and 


5 For detailed suggestions regarding testing procedure, 
(28, Ch. 20) and Watson (78, Ch. 12) for the testing of p 
(67. pp. 45-59) for individual testing of older children 
Thorndike (69, Ch. 9), and Ligon (46) for group testing. 


the reader is referred to Goodenough 
reschool children; Terman and Merri 
and adults; and Lindquist (47, Ch. 10). 
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should provide adequate lighting, ventilation, seating facilities, and working 
Space for the subjects. Special steps should also be taken to prevent Áinter- 
Tuptions during the test. Posting a sign on the door to indicate that testing 
is in Progress is effective. provided all personnel have learned that such a 
Sign means no admittance under any circumstances. In the testing of large 
Broups, locking the doors or posting an assistant outside each door may be 
necessary to preventthe entrance of late-comers. 

Testing Conditions. It is important to realize the extent to which testing 
Conditions mayinfluence scores. Even apparently minor aspects of the testing 
Situation may appreciably alter the subjects’ performances. Such a factor as 
the use of desks or of chairs with desk arms, for example, proved to be 
Significant in a group testing project with high school students, the groups 
Using desks tending to obtain higher scores (44, 73). Similarly, whether the 
examiner is a stranger or someone familiar to the subjects has been found 
to make a significant Sirevence in test scores (61, 74). In another study, the 
8eneral manner and behavior of the examiner, as illustrated by smiling. 
nodding, and making such comments as “Good” or kac eni shown to 
have a decided éifect upon test results (82). In a projective test requiring 
the Subject to write stories to fit given pictures, the presence of the examiner 
IN the room tended to inhibitvíhe inclusion of strongly emotional content in 
the stories (9). 

Previous Activities of Subjects. The subjects’ Éctivities immediately pre- 
Ceding the test may affect their performance, especially when such activities 
Produce emotional disturbance, fatigue, or other handicapping conditions. 
d- and fourth-grade school children, there 


Thus in an investigation with thir 
g 


Was some evidence to suggest that IQ on a non-verbal intelligence test was 


'nfluenceq by the children’s preceding classroom activity (48). On one occa- 
Sion, the class had been engaged in writing a composition on “The Best Thing 
That Ever Happened to Me”; on the second occasion, they had again bata 
Writing. but this time on “The Worst Thing That Ever Happened to Me." 
The IO's on the second test. following what may have been an emotionally 


lepressitig experience averaged 4 or 5 points lower than on the first test. 


These findings were corroborated in a study specifically designed to de- 
termine{he effeckof. immediately preceding experience upon test performance 
(39), The same test was employed as in the earlier study. In the later in- 
Yestigation, children who had had a gratifying experience involving the suc- 
followed by a reward of toys and 


Cess : . 
Ssful solution of an interesting puzzle. 
est scores than those who had 


Ca À 2 : 
ndy, showed more improvement in their t 


Under. fui i 
dergone neutral or less gratifying experiences. 
Response Sets, Test performance may also be affected by the response sets 
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s Z 
with which subjects approach the test (cf. 17). Owing to the “ature of the 


instructions and to the"form in which test items are expressed, the subject 
may be set to respond in a particular way. In such cases; changing sa iai 
form or modifying the instructions would alter the sut s responses. The 
response set thus represents a condition specific to the particular test and pus 
be quite irrelevant to the aptitude being measured. Moreaver, when test 
instructions are ambiguous or fail to specify certain aspects of procedure. 


response sets may differ from person to person, thereby introducing another 
source of variation in testing conditions. 


One illustration of response sets is t 
or take a chance when not sure of the 
no instructions regarding 


o be found in the tendency to guess. 
answer, rather than omit the item. If 
Buessing are provided, the more "reckless" subjects 
may guess on every uncertain item. leaving no omissions. Other, more cau- 
tious individuals may only mark those answers of which they feel very con- 
fident. This difference in response set prob 


acteristics rather than aptitude, and would therefore serve only to reduce 
the validity of aptitude scores. In 


general, the subject who guesses will have 
a certain advantage, since at least some of his guesses will be correct by 
chance. Except in purely blind guessing, individuals will tend to be more 
often right than wrong in their guesses, since they usually have some knowl- 
edge about the items on which they guess; 
formulas for guessing, such as “number 
true-false tests, tend to undercorrect. On th 
try to make the incorrect alternatives soun 
who guesses will choose 


ably reflects personality char- 


for this reason common correction 
right minus number wrong" for 
e other hand, item writers usually 
d so plausible that every examince 
à wrong answer. In so far as this goal is achieved, 
the usual correction formulas would Overcorrect. 

The correction for guessing will, 


of course, be smaller on multiple-choice 
than on true-false tests 


. and the greater the number 
the smaller will be the correction.5 Thus the prob. 
on a true-false item would be one out of two; 
multiple-choice item, 
that if 


of alternative responses 
ability of guessing correctly 
While on a five-alternative 
it would be one out of five. It should also be noted 
all subjects are instructed to omit no items and to guess when not 
sure. no correction for guessing is needed. In such a case, each subject's 
relative position in the group would be identical whether or not any correc- 
tion is applied. Such a procedure also provides a uniform response set, thus 
eliminating the effects of individual differences in willingness to take a 


f The general correction formula for guessing is: 
Ww 


Corrected score R £I in which R is the number of items right, W the number of items 
wrong, and » is the number of alternative responses per item. In a true-fal 


se test, in which there 
are only two alternative answers per item, the corrected score reduces to R W. 
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chance. For these reasons, many psychological tests specify that no omissions 
are to be made. 

The question of how to handle guessing on psychological tests is still a 
controversial one. Should subjects be instructed to guess or not to guess? 
Should a correction for guessing be applied in scoring? Arguments can be 
found in support of different practices, the most desirable solution probably 
Varying with the situation. The test manual, however, should recognize these 
Problems and indicate how they ought to be handled. 

A similar question pertains to the relative emphasis to be put upon speed 
and accuracy. On certain clerical aptitude tests, for example, a very care- 
ful worker may be handicapped because he proceeds too slowly. in order to 
void errors, The individual who races through the test, on the other hand, 
may Complete twice as many items and make only a few errors. His over- 
all score would therefore be much higher than that of the slow, cautious 
Worker, Before beginning a test. subjects should be given a clear picture of 
the relative importance of speed and accuracy in determining their scores. 

Other, more specific examples of response sets include the tendency to 
answer “true” rather than "false" when not sure about a true-false item, 
and the tendency to check many or few responses when the subject is free 
to check as many responses as he chooses. The latter response set could 
Occur, for example, if the subject is required to check all the statements 
that follow from a certain premise in a logical reasoning test. 

In Beneral, the problem of response sets is the concern of the test con- 
‘tructor, The test user, however, should take response SEU mio. account in 
choosing tests. Other things being equal, a test that reduces tudividiial differ- 
ences in response sets is to be preferred. Multiple-choice items feguiring a 
single response are the most satisfactory in this regard. The test instructions, 
Moreover, should be checked for the clarity with which such matters as 
Suessing and the relative importance of speed and accuracy are treated, 
Finally, in the process of administering a test, the examiner should be on 

is guard against the introduction of any response sets not covered by the 

Standardized instructions. This caution is especially pertinent to the answer- 
Ing of questions raised by subjects and to preliminary announcements re- 
Sarding the purpose of the test. ; : E ; 

Clinical Interaction. A comprehensive analysis of the Operation of examiner 
and Situational variables in test performance has vorn provided by Sarason 
(62) under the general concept of “clinical interaction.” By reference to 
Publisheg MORE a6 well as to his own research ana case reports, Sarason 
demonstrates that test scores are susceptible to the interaction of individual 
Subject Characteristics with examiner and situational variables. Among the 
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significant examiner variables are included age, Sex, race, professional gnd 
socioeconomic status, appearance, and such behavioral characteristics as self- 
confidence, aggressiveness, responsiveness, and social warmth. Situational 
variables are illustrated by the place where the test is administered (school, 
clinic, hospital, jail, psychology laboratory, etc.); the expectations and at- 
titudes built up by the way in which the forthcoming test experience was pre- 
sented to the subject; and the nature of specific instructions. That the effect 
of all such conditions can be best conceptu 
exemplified by the finding that such factors as ego-involving instructions and 
appearance of the examiner have a different influence on subjects with diflerent 


personality characteristics. A more recent survey of studies on the effects of 
interpersonal and situational vari 


Masling (53). 

Although most of the studies reported by Sarason and by Masling concern 
individual clinical examinations with project 
to some extent be subject to social 
evidence, for example, that the emoti 
jects influences the results obt 


alized in terms of interaction is 


ables on test performance was prepared by 


ive techniques, all testing will 
and situational interaction. There is 
onal interaction of examiner and sub- 
ained with individual intelligence tests (52). 
It is not to be inferred, however, that we should abandon the practice of 
administering tests under standardized conditions. Rather the examiner 
should be all the more alert to control those conditions that can be kept 
uniform, so as to retain the applicability of norms. At the same time, any 
aspects of the particular test situation that cannot be controlled should be 
clearly recognized and taken into account in interpreting the subject's re- 


sponses on the test. Ignoring inevitable Sources of variation is no protection 
against their effects. 


PROBLEMS OF SCORING 


The principal considerations in the sele 


procedures are accuracy, speed, and economy. The last two are especially 
important in large-scale testing programs. Efficient operation of the scoring 
program, with a minimum of wasted effort, helps to reduce cost. An even 


mining the cost of Scoring is the selection of 


tests that provide appropriate Scoring techniques. Speed of scoring is a major 
consideration in many testing programs in which results must be made availa- 
ble promptly, as in testing military personnel and college or professional- 
school applicants. Accuracy is of course an essential requirement of all types 
of scoring, whether in individual or group testing. Research on the develop- 
ment and improvement of scoring techniques is directed toward this triple 


ction and application of scoring 
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Objective of accuracy, speed. and economy, although the relative emphasis 
given to the three aspects varies with the nature of the test and with the use 
to which it is put. 

Individual Tests. Whether of the verbal or performance type, individual 
tests are generally scored by the examiner. While administering the test, the 
examiner records the subject's responses on a printed record form, Some 
lests, such as the Stanford-Binet, need to be scored during the process of 
test administration, since the presentation of further items depends upon the 
subject's performance on prior items; other individual tests may be scored 
later. In so far as subjective judgment enters into the evaluation of any re- 
Sponse, the results obtained by independent scorers should be checked on a 
representative sample of record sheets. This check provides a measure of one 
aspect of the reliability of the test, and will be discussed further in Chapter 
5. A different type of checking is the routine repetition of certain steps in 
the Scoring to correct any clerical or computational errors. Even such a 
Simple operation as the computation of an IO reveals Such common errors 
that checking is imperative. A vivid demonstration of this fact can easily be 
arranged by having any class compute an IO from a given mental age and a 
given date of birth. The range of IO's reported by different students should 
Prove impressive. : y : . 

Group Tests. Since speed and economy are major considerations in group 
lesting, such tests do not as a rule need to be scored by a trained examiner, 
but usually permit either hand scoring by a clerk or machine scoring.” The 
a specially printed answer sheet. But manual scoring 


latter always requires 
e efficiently through the use of a separate answer 


can also be done mor 
Sheet on which the subjects record their responses. Such an arrangement not 


Only provides greater compactness of records and saves scoring labor, but it 


also permits the repeated use of the same test booklets by different subjects. 


One of the most common hand-scoring procedures utilizes the fan or ac- 


COrdion key, The correct answers for each page of the test booklet ale 
Printed in a separate column. When the key is folded vertically, in accordion 
fashion, only one column is visible at a time. With such a key, scoring in- 
Volves the comparison of the subject’s answers with the corresponding an- 


SWers given on the key. This type of scoring requires more time than many 


Other available techniques. It is most often used with tests in which the sub- 


ject writes in each response, instead of selecting and marking the correct 


response to each item. cater 
An efficient type of scoring key for hand scoring is the punched or cut- 


7 For a comprehensive and critical survey of a large number of available scoring techniques, 
cf. Lindquist (47, pp. 365-413). 
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oring stencil. Such a key is applicable to multiple-choice items in which 
ens iom danda Tesponse occupies a different position on the answer sheet. 
top windows are cut on the scoring stencil to correspond to the posi- 
tions of the correct alternatives. Superimposing such a stencil over the we 
swer sheet immediately reveals all items that have been correctly ees 
by the subject. A few items from a test that utilizes this type of scoring oe 
shown in Figure 7. In this test, the subjects are instructed to mark the wor 
in each line that means the Opposite of the first word. 


saucy — rude Evs polite 


agile — clumsy Stealthy nimble 


O 


son — daughter mother father 


frolic — pain work gastric 


Fig. 7. Sample Items from a 
(From Pintner-Durost Elementary 


Test Booklet Scored with a P 
Test; copyright by World Bo 


unched Scoring Stencil. 
ok Company.) 


A number of tests utilize Self-scoring answer sheets, 
self-scoring carbon Pad provided v 
(see Ch. 13), In these tests. the s 
ofa two-page answer booklet which js folded together and cannot be opened 
without tearing. Squares Corresponding to the Position of the Correct answer 
€ booklet, where the sub- 
y recorded through the carbon backing of 
The Subject, of course, cannot see the inside of the 
booklet without tearing the booklet apart, which would invalidate his test. 
To score the test, it is only necessary to tear the booklet Open and count 
the number of X's that fall within the printed Squares, 
answer sheet is reproduced in Figure 8. 

Another self-marking device is the pin 
stance, with the Kuder Preference Record, a Measure of 
(see Ch. 19). In this case, the subject is Provided w 
which he punches holes in the 


An example is the 
vith the tests of Primary Mental Abilities 


A section of this 


“Punch answer pad used, for in- 
vocational interests 


ith a metal pin, with 
appropriate circles on the answer sheet to 
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indicate his preferences (sce Fig. 112. p. 537). On the inside of the answer 
booklet are printed sets of circles corresponding to the responses to be 
Scored within each interest area. The subject's score in each of these areas is 
found by counting the number of holes punched within the appropriate set 
of circles, 


A Portion of the Self-Scoring Carbon Answer Pad Used with the SRA Tests 
Mental Abilities for Ages 11 to 17. The illustration shows the inner or 
las ; (Reproduced by permission of Science 


Fig. g, 
9f Primary 1 
scorers eye" view, never seen by the subject. 
Research Associates.) 


Parenthetically, it may be added that another type of so-called self-scorer 
ig designed, not for the convenience of the examiner, but for the instruction 
Of the Subject. In such a test, for example, the subject keeps on choosing 
answers in a multiple-choice item until the correct one is found. The scoring 
device may be a punchboard, in which a full hole can ba punched only 
through correct answers, since these correspond to apertures in the perforated 
Stenci] inserted under the answer sheet. Incorrect answers produce only a 
Smal] pin prick. Or, the holes for the wrong answers may show red on the 
backing sheet, while the correct one appears black. The subject’s score on 
SUCH i test ean Be determined by counting the number of holes punched; 
the fewer the holes or attempts he requires to reach the correct answers, 
the better his score. The self-scorer principle underlies many of the so-called 
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teaching machines currently being used for diss purposes (18, 22, 38, 
58). Further reference to these machines will be miade in Chapter 16. of 

For large-scale testing programs, machine scoring is the most ee 
rocedure. The IBM Test-Scoring Machine, developed by the aaa 
see Machines Corporation, is probably the best-known device for fie 
purpose. Use of this machine requires specially prepared, patented answer 
sheets, on which the subject records his responses by blackening the space 
between two lines. Part of such an answer sheet is shown in Figure 9. Sub- 


Fig. 9. Part of an Answer Sheet f. 


or Use with the IBM Test-Scoring Machine. 


(Reproduced by permission of International Business Machines Corporation.) 


jects are also provided with special soft pencils for marking the answers. 


The test-scoring machine operates essentially in terms of electrical cont 
The graphite pencil-marks on the answer sheet establish cont 
of contact brushes in the machine. These electrical contacts are counted by 


the machine and the total thus obtained represents the 
When a scoring stencil with holes cut 


the machine, contacts are made onl 
Thus the number of Correct respons 
and omissions can 


Scoring stencil. 


A special precaution to be observed with machine-scored answer sheets is 
the inspection of the answer sheets prior to scoring. If subjects have not 
followed directions carefully, some of their pencil-marks may not be heavy 
enough to establish proper contact in the machine. Similarly, stray, unin- 
tentional pencil-marks may be counted as wrong responses by the machine. 
Investigation has shown that such carelessly marked answer Sheets may 
significantly lower the mean score of a group; in individual cases, the scores 
may be lowered by more than 25 per cent (72). Much of this difficulty, of 
course, can be avoided by proper instructions and demonstration during the 


acts. 
act across a set 


subject's score. 
over the correct responses is inserted in 
y with the marks in the correct spaces. 
es can be read from the machine. Errors 
also be scored by inserting another, appropriately cut 
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administration of the test, as well as by adequate proctoring. Nevertheless, 
all answer sheets should be inspected before machine scoring. 

Systematic procedures for checking the actual scoring represent an essen- 
tial Step in all large-scale testing programs. Such checking is designed to 
detect both chance and constant errors. The first type of error generally 
results from incorrect reading or transcribing of scores, or from other routine 
clerical inaccurracies. Such errors can be checked by the independent repeti- 
tion of those scoring operations subject to clerical errors, for all papers. Con- 
Stant errors may result from procedural mistakes, such as the use of the 
Wrong scoring key. They can be caught by spot checking all operations, 
through the complete and independent rescoring of a sample of papers. 

Scorer Bias. A subtle kind of constant error that may affect the scoring of 
Certain types of individual as well as group tests results from the mental set 
Or "unconscious bias" of the scorer. An example will suggest how such errors 
May operate, In a study of the errors made by a school teacher in grading 
Spelling papers, the number and direction of the scoring errors were found 
to be significantly related to the ratings for "personal attractiveness" given 
by the teacher to the same children (cf. 27, p. 376). Although the teacher 
had tried to score the papers accurately, the errors she made tended to raise 
the scores of children whom she found personally more attractive, and to 
lower the scores of children whom she considered less attractive. 

Such constant errors resulting from personal bias could also operate in 
reference to a whole group. For example, in the comparison of ethnic groups, 
Socioeconomic levels. or urban and rural samples, the scorer's own expecta- 
tion of superiority or inferiority in certain groups could influence his scoring 
errors, Similarly, in an experimental investigation, the experimenters hy- 
Pothesis might lead him to expect a certain group difference and might thus 
bias his PEEN It should be noted that such constant errors occur despite 
the scorer’s eure efforts to be accurate and objective. The surest way to 
Prevent such errors is to keep the scorer—and if possible, the examiner— 
in ignorance of the subject's group membership. This can be accomplished 
by removing names and other identifying marks and by coding the test 
Papers prior to scoring. Such precautions are especially important in the case 
9f tests whose scoring involves a certain degree of subjectivity, such as pro- 
Jective techniques and many individual intelligence scales. 
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CHAPTER 4 


Norms: Their Nature and Interpretation 


It will be recalled from Chapter 2 that a “raw score" on any psychological 
test is, in itself, quite meaningless. Obviously, to say that an individual has 
correctly solved 15 problems on an arithmetic reasoning test, or identified 
34 words in a vocabulary test, or successfully completed a mechanical puzzle 
in 57 seconds conveys little or no information about his standing in any of 
these functions, 

Nor do the familiar "percentage scores" provide a satisfactory solution to 
the problem of interpreting test scores. A score of 65 per cent correct on one 
vocabulary test, for example, might be equivalent to 30 per cent correct on 
another, and to 80 per cent correct on a third. The difficulty level of the 
items making up each test will, of course, determine the meaning of the 
score. In a test in which the items are answered correctly, on the average. 
by 90 per cent of the subjects, the average score of the group will be 90 
per cent of a perfect score. Such a test would be ver 
differentiating among superior individuals, since the upper half of the group 
would be bunched between scores of 90 per cent and 100 per cent. On the 
other hand, a test in which 50 per cent of the 
answer each item correctly would yield an 
right, and would permit finer discrimination among individuals throughout 
the range. A score of 50 per cent on the latter test would correspond to a 
score of 90 per cent on the former. Like all raw scores, 
scores can be interpreted only by reference to norms. 

Essentially, psychological test norms Tepresent the test performance of the 
standardization sample. The norms are thus empirically established by de- 
termining what a representative group of persons actually do on the test. 
Any individual's raw score is then referred to the distribution of scores ob- 
tained by the standardization sample, to discover where he 
distribution. Does his score coincide with the 
76 


y unsatisfactory for 


subjects, on the average. 
average score of 50 per cent 


such percentage 


falls in such a 
average performance of the 
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standardization group? Is he slightly below average? Or does he fall near 


the upper end of the distribution? 

In order to determine more precisely 
reference to the standardization sample, the raw score is converted into some 
relative measure. Such converted scores are designed to serve a dual purpose. 
First, they indicate the individual's relative standing in the normative sample 
and thus permit an evaluation of his performance in reference to other per- 
sons. Secondly, they provide comparable measures which make possible a 
direct comparison of the individual's 
example, we find that a given individual has a raw score of 40 on a vocabu- 
lary test and 22 on an arithmetic reasoning test, we obviously know nothing 
about his relative performance on the two tests. Is he better in vocabulary or 
good in both? Since raw scores on different tests are 
rent units, a direct comparison of such scores is im- 
ar test would also affect such a 


the individual's exact position with 


performance on different tests. If, for 


in arithmetic, or equally 
usually expressed in differen 
possible. The difficulty level of the particul 
comparison between raw scores. Converted scores, on the other hand, can be 
expressed in the same units and referred to the same or to closely similar 
normative samples for different tests. The individual's relative performance 
in many different functions can thus be compared. 

There are various ways in which raw scores may be converted to fulfill the 
two objectives stated above. Fundamentally. however, test scores are of three 
major types: age scores. percentiles, and standard scores. These types. to- 
gether with some of their common variants. will be considered in separate 
Sections of this chapter. But first it will ae 
mentary statistical concepts that underlie the development and utilization of 
norms. The following section is included simply to clarify the meaning of cer- 
ti measures. Simplified computational examples are 
d not to provide training in statistical methods. 


ll be necessary to examine a few ele- 


tain common statistical 


given only for this purpose an à 
For Pers uam details and specific procedures to be followed in the prac- 


tical application of these techniques. the reader is referred to any m: text- 
book on psychological or educational statistics, such as Blommers and Lind- 
quist (6), Garrett (12). Guilford (14). or Walker and Lev (27). Shorter 
and more elementary introductions to statistical method have also been pub- 


lished by Garrett (1 1) and by Walker and Lev (28). 


SOME ELEMENTARY STATISTICAL. CONCEPTS 


A major object of statistical method is to organize and summarize quanti- 
A : acilitate their understanding. A list of 1000 test 


tative data in order to f x : : 
ight. In that form, it conveys little meaning. 


scores can be an overwhelming $ 
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A. first step in bringing order into such a chaos of raw data is to tabulate the 
scores into a frequency distribution, as illustrated in Table 1. Such a distribu- 
tion is prepared by grouping the scores into convenient class intervals and 
tallying each score in the appropriate interval. When all scores have been 
entered, the tallies are counted to find the frequency, or number of cases, in 
each class interval. The sums of these frequencies will equal N, the total num- 
ber of cases in the group. Table 1 shows the scores of 1000 college students 
in a code-learning test in which one set of artificial “words,” or nonsense 
syllables, was to be substituted for another. The raw scores, giving number 
of correct syllables substituted during a two-minute trial, ranged from 8 to 
52. They have been grouped into class intervals of 4 points, from 52-55 
at the top of the distribution down to 8-11. The frequency column reveals that 


2 persons scored between 8 and 11, 3 between 12 and 15, 8 between 16 and 
19, and so on. 


TABLE 1. Frequency Distribution of Scores of 1000 
Learning Test 
(From Anastasi, 1, p. 34) 


College Students on a Code- 


Class Interval 


Frequency 

52-55 1 
48-51 1 
44-47 20 
40-43 73 
36-39 156 
32-35 328 
28-31 244 
24-27 136 
20-23 28 
16-19 8 
12-15 3 
8-11 ? 
1000 


The information provided by a frequency distribution can also be pre- 
sented graphically in the form of a distribution curve. Figure 10 shows the 
data of Table ] in graphic form. On the baseline, or horizontal axis, are the 
scores, grouped into class intervals; on the vertical axis 


À are the frequencies, or 
number of cases falling within each class interval. 


à à The graph has been 
plotted in two ways, both forms being in common use. In the histogram, the 


height of the column erected over each class interval corresponds to the num- 
ber of persons scoring in that interval. We can think of each individual 
standing on another's shoulders to form the column. In the frequency poly- 
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gon, the number of persons in each interval is indicated by a point placed in 
the center of the class interval and across from the appropriate frequency. 
The successive points are then joined by straight lines. 


340 = 
320 


Frequency Polygon 


——--— Histogram 


Number of Cases 
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51 
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Polygon and Histogram. (Data from Table 1» 


55 


Fig. 10. Distribution Curves: Frequency 


Except for minor irr the distribution portrayed in Figure 10 re- 
sembles the bell-shaped normal curve. A mathematically determined, perfect 
normal curve is reproduced in Figure 12. This type of curve has important 
mathematical properties and provides the basis for many kinds of statistical 
analyses, For the present purpose. however, only a few jenures will be 
Noted. Essentially, the curve indicates that the largest number or cases cluster 
in the center of the range. and the number drops off gradually in both direc- 


lions as the extremes are approached. The curve is bilaterally symmetrical. 
TM E H H H M & 
the center. Most distributions of human traits, from 


aptitudes and personality characteristics, approximate 
al, the larger the group. the more closely will the 
cal normal curve. 

described in terms of some measure of cen- 
tral tendency. Such a measure provides a single. most typical or representa- 
live score to characterize the performance of the entire group. The most 
familiar of these measures is the average, more technically known as the 
mean (M). As is well known. this is found by adding all scores and dividing 
the sum by the number of cases (N). Another measure of central tendency 
is the mode, or most frequent score- In a frequency distribution, the mode is 


egularities, 


with a single peak in 
height and weight to 
the normal curve. In gener 
distribution resemble the theoreti 

A group of scores can also be 
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the midpoint of the class interval with the highest frequency. Thus in s 
1. the mode falls midway between 32 and 35, being 33.5. It will be noted 
that this score corresponds to the highest point on the distribution curve in 
Figure 10. A third measure of central tendency is the median, or middlemost 
score when all scores have been arranged in order of size. The "median is 
the point that bisects the distribution, half the cases falling above it and half 
below. A 

Further description of a set of test scores is given by measures of vari- 
ability, or the extent of individual differences around the central tendency. 
The most obvious and familiar way of reporting variability is in terms of the 
range between the highest and lowest score. The range, however, is extremely 
crude and unstable, since it is determined by only two scores. A single unusu- 
ally high or low score would thus markedly affect its size. A more precise 
method of measuring variability is based on the difference between each 
individual's score and the mean of the group. 

At this point it will be helpful to look at the example in Table 2, in which 
the various measures under consideration have been computed on 10 cases. 
Such a small group was chosen in order to simplify the demonstration, although 
in actual practice we would rarely perform these computations on so few 


TABLE 2. Illustration of Central Tendency and Variability 


Score Diff. Diff. Squared 
(X) (x) (x2) 
( 48 +8) 64 
50% of | ^ gu 42 
cases ips +3) + 20 9 
j^! + d 1 
Median = 40.5 -—— i4 +1 1 
£40 01 0 
50% of |38 EN , d 
cass 12> =< [29 16 
—6 36 
64 
x Xx? = 244 
M= 
40 
10 a 


Variance — 


SDor,— 
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cases. Table 2 serves also to introduce certain standard statistical symbols 
which should be noted for future reference. Original raw scores are conven- 
tionally designated by a capital X, and a small x is used to refer to deviations 
of each score from the group mean. The Greek letter X means "sum of.” It 
will be seen that the first column in Table 2 gives the data for the computation 
of mean and median. The mean is 40; the median is 40.5, falling midway 
between 40 and 41—five cases (50 per cent) are above the median and five 
below, There is little point in finding a mode in such a small group, since the 
cases do not show clear-cut clustering on any one score. Technically. how- 
ever, 41 would represent the mode. since two persons obtained this score. 
while all other scores occur only once. 

The second column shows how far each score deviates above or below the 
mean of 40. The sum of these deviations will always equal zero. since the 
positive and negative deviations around the mean necessarily balance, or 
cancel cach other out ( 4-20 — 20 — 0). If we ignore signs, of course, we can 
average the absolute deviations, thus obtaining a measure known as the 
average deviation (AD). The symbol |x| in the AD formula indicates that 
absolute values were summed, without regard to sign. Although of some de- 
Scriptive value, the AD is not suitable for use in further mathematical analyses. 
because of the arbitrary discarding of signs. 

A much more serviceable measure of variability is the standard deviation 
(symbolized by either $D or o). in which the negative signs are legitimately 
eliminated by squaring each deviation. This procedure has been followed in 


the last column of Table 2. The sum of this column divided by the number of 


NO n -— " "e lati P is s M 
Cases ie ) is known as the variance, or mean square deviation, and is sym 


bolized B ot, The variance has proved extremely useful in sorting out the 
Contributions of different factors to individual differences intest performance. 
For the present purposes. however, our chief concern is wn the SD, which 
is the square root of the variance, as shown in Table 2. This measure is com- 
monly employed in comparing the variability of diflerent iod in Figure 
1, for example, are two distributions having the same mean üt differing in 
variability. The distribution with wider individual differences yields a larger 
SD than the one with narrower individual differences. ae 

The SD also provides the basis for expressing an individual s scores on 
will be shown in the section on standard 


s especially clear-cut when applied to a 

Normal or approximately normal distribution curve. In such a distribution, 
a , 

there is an exact relationship between the SD and the proportion of cases, as 

shown in Figure 12 On the baseline of this normal curve have been marked 
Y TI z. 


different tests in terms of norms, as * 
Scores. The interpretation of the SDi 
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x ———— Large SD 
EN === Small SD 


Number of Cases 


Scores 


Fig. 11. Frequency Distributions with the Same Mean but Different Variability. 
distances representing one, two, and three standard deviations above and 
below the mean. For instance, in the example given in Table 2, the mean would 
correspond to a score of 40, --le to 44.9 (40 + 4.9), --2e to 49.8 
(40 + 2 X 4.9), and so on. The percentage of cases that fall between the 
mean and + lo in a normal curve is 34.13. Since the curve is symmetrical, 
34.13 per cent of the cases are likewise found between the mean and — lo, 
so that between + lo and — le on both sides of the mean there are 68.26 per 
cent of the cases. Nearly all the cases (99.72 per cent) fall within +36 from 
the mean. These relationships are particularly relevant in the interpretation 
of standard scores and percentiles to be discussed in later sections. 


99.72% 
y 95.44% Y 
Y 68.26% 
$r | 
Ji | | 
o 
ol | | 
S | 
HI | | 34.13% | 34.13% | 
X | | | | 
a. | | | | 
r | | | 
[l | 
—3c —-2c -lo Mean +o +20 +30 
Fig. 12. 


Percentage Distribution of Cases in a Normal Curve. 


AGE SCORES 


The concept of mental age, it will be recalled, was introduced in the 1908 
revision of the Binet-Simon scales. In age scales such as the Binet and its 


revisions, individual items are grouped into year levels. For example, those 
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items passed by the majority of 7-year-olds in the standardization sample 
are placed at the 7-year level, those passed by the majority of 8-year-olds are 
assigned to the 8-year level, and so forth.’ A child's score on this test will then 
correspond to the highest year level that he can successfully complete. lf, 
for example, a 10-year-old child can satisfactorily complete the 12-year items, 
his mental age (MA) is 12. although his chronological age (CA) is 10. He 
is thus two years accelerated, since he equals the performance of an average 
child two years his senior. 

In actual practice, the individual's performance on age scales such 
Binet shows a certain amount of scatter. In other words, the subject fails 
nental age level and passes some above it. For this 
. the highest age at and 


s the 


Some tests below his n 
reason, it is customary to compute 
below which all tests are passed. P 


to this basal age for all tests passed at high 
f the basal age an 


the basal age, i. 
artial credits, in months, are then added 


er year levels. The child's mental 


age on the test is the sum o d the additional months of credit 


earned at higher age levels. 
d Mental age norms may als 
into year levels. In such a case, 


Such a score may be the total num 
it may be based on time. on number of errors, or on some combination of 


Such measures. The mean raw scores obtained by the children in each year 
Soup within the standardization sample constitute the age norms for such 
a test. The mean raw score of the 8-year-old children, for example, would 
represent the 8-year norm. If an individual’s raw score is equal to the mean 
8-year-old raw score. then his mental age on the test is 8 years. All raw scores 


9n such a test would be transformed in 


o be employed with tests that are not divided 
the subject's raw score is first determined. 
ber of correct items on the whole test; or 


a similar manner by reference to the 


age norms. : : 
It should be noted that the mental age unit does not remain constant with 


age, but tends to shrink with advanci 
9ne year retarded at age 4 will be app 
12. One year of mental “growth” from 
Of growth from ages 9 to 12. Since inte 


l'ápidly at the earlier ages and gradually de dinge eh 
his mature limit, the mental age unit shrinks correspondingly with age. This 


relationship may be more readily visualized if ye think kai the individual's 
height NOLAN expressed ight age. The difference, in inches, 
beiwecha sach age" of 3 ld be greater than that between a 
“height b i: io and IL. ogressive shrinkage of the MA 
aust pass an item varies somewhat at different year levels. This 


er year levels, if the IQ is to remain c 
lower to the upper year s A EL 
s point. cf. McNemar (20, p. 9). 


ng years. For example, a child who is 
roximately three years retarded at age 
ages 3 to 4 is equivalent to three years 
llectual development progresses more 
ecreases as the individual approaches 


in terms of "he 
and 4 years wou 
Owing to the pr 


1 

Ber The exact percentage who m 

Stan eee must decrease from, 
1. For a fuller explanation oft 
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unit, one year of acceleration or retardation at, let us say. age wir cse ie 
greater deviation from the norm than a similar amount of acceleration 
; " : 

gp Sano a measure that, unlike mental age, permits a uniform 
interpretation regardless of the age of the subject, the Intelligence Quotient 
(IO) was introduced. Although the need for such a ratio measure had been 
previously indicated by Stern and by Kuhlmann, the IO was first employed 
in the 1916 form of the Stanford-Binet. The IQ is the ratio of mental age to 
chronological age, the fraction being customarily multiplied by 100 in order 
to avoid the use of decimals, as shown below: 


MA 
IO = 100—— 
Q CA 


If a child's mental age equals his chronological age, his IQ will be ex 
100. An IQ of 100 thus represents normal or av 


below 100 indicate retardation, and those above 100, 
age in the mental age unit is automatically adjusted 
For example, if a 4-year-old has a mental age of 3, his IO will be 75 (100 X 
3/4 — 75). The same child at age 12 will probably have a mental age of 9, 
and his IO will still be 75 (100 x 9/12 — 75). Such an IQ indicates the 
same relative standing in the group, whether obtained by a 4-year-old or bya 
12-year-old. The IQ is thus comparable at different ages, in the sense that the 


interpretation of a particular IQ remains the same regardless of the age of 
the subject. 


The IQ, however, will remain constant only 
shrinks in direct proportion with age. This condi 
is to be meaningfully employed with a Particular test. When the mental age 
unit shrinks, individual differences measured in terms of such 
crease proportionately. This follows arithmetically 
individual differences in height will be 12 times as large when measured in 


inches as when measured in feet, since the inch is 1/12 as large as the foot. 
Accordingly, individual differences in ment 
twice as large at age 14 as at 


actly 
erage performance. IQ's 
acceleration. The shrink- 
by the use of the ratio. 


when the mental age unit 
tion must be met if the IO 


à unit will in- 
. in the same way that 


i stant. It is only under 
such conditions that a gi i 


This condition was met closely enough in the 1937 St 


to make the IQ applicable to this test. Figure 13 shows t 
Binet mental ages from ages 6 to 18 


the middle 68 per cent of the cases 
approximate percentage of cases f. 


anford-Binet (20, 24) 
he spread of Stanford- 
, as indicated by the range of approximately 
at cach age. It will be recalled that this is the 


alling between +1 SD and — 1 SD from the 
mean in a normal curve. The trend toward greater variability in mental age 
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with increasing chronological age is clearly apparent in Figure 13. Even in 
such a carefully constructed test as the Stanford-Binet, however. some in- 
equalities remained in the SD's of the IQ at different ages (13, 24), a difficulty 
that was handled by the preparation of a correction table to be used with 
IQ's at certain age levels (20, pp. 173-174). In the 1960 revision of the 
Stanford-Binet (25). the problem was circumvented by replacing the ratio 
IQ with deviation IO's. to be discussed later in this chapter. 


o 


Mental Age 
S 
=a 


io n 12 133 1 15 16 17 18 


gy. B9 
é Chronological Age 


Fig. 13. Means and Standard Deviations of Stanford-Binet Mental Ages. (Data 


from McNemar. 20. pp. 32-33) 

sts that provide age norms, however, the 
Conditions for IQ constancy are not met. In such tests, the same IQ may 
signify different degrees of superiority or inferiority at different ages. In the 
Merrill-Palmer scale, for example, an IQ of 114 at one age may indicate the 
same degree of superiority as an 1Q of 141 at another age (23). Despite its 
apparent logical simplicity. the IQ is not directly applicable to most psycho- 
logical tests: Its usc should be preceded by a thorough check of variability at 
that the condition of uniform IO variability. or pro- 
MA variability, has been met. 


In several other intelligence te 


different ages, to insure 
Portionately increasin? 
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A further limitation in the applicability of the IO is to be found in adult 
testing. A consideration of the very concept of age norms will indicate that 
their usefulness is largely restricted to children. To say that a 10-year-old 
child has a mental age of 12 conveys a vivid and objectively definable picture. 
It is one of the chief advantages of the mental age concept that it can be 
clearly grasped by the layman. Such an advantage is greatly reduced, however, 
with adult mental ages. On such a test as the Stanford-Binet, for example, the 
average adult does not improve much beyond the 15-year level. Hence the 
avenue adult mental age on this test is under 16 (actually 15 years-9 
months). To be sure, superior adult levels have been added to give the test 
adequate ceiling, and mental ages above 15-9 may thus be obtained. But to 
say that a particular adult received a mental age of 20 on such a test does not 
permit the same clear-cut interpretation as would be possible with a mental 
age of 8 or 10. Certainly a mental age of 20 cannot be defined as what the 
average 20-year-old can do, since the average 20-year-old obtains a mental 
age of 15-9. In testing a feebleminded adult whose mental age is below 15-9, 
of course, the mental age concept is as applicable as with children. With nor- 
mal and superior adults, however. other types of scores, such 
standard scores, are now commonly employed. 

Still another limitation of age scores arises from the f 
employed only with functions that show 
age. Traits that exhibit little rel 
to measurement in terms of 


as percentiles or 


act that they can be 
a clear and consistent change with 
ation to age obviously do not lend themselves 


age units. Most personality characteristics, for 
example. would fall into this category. 


PERCENTILES 


Percentile scores are expressed in terms of th 
standardization sample who fall below 


» except that in ranking it is customar . 
the best person in the group receiving a rank of i 
other hand. we begin counting 


The 50th percentile (P5,) corresponds to the median, already discussed 


tendency. Percentiles above 50 represent abovc- 
those below 50 signify inferior performance. The 25th 


as a measure of central 
average performance, 
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and 75th percentile are known as the first and third quartile points (Q, and 
Qx), since they cut off the lowest and highest quarters of the distribution. Like 
the median, they provide convenient landmarks for describing a distribution 
of scores and comparing it with other distributions. 

Percentiles should not be confused with the familiar "percentage scores." 
The latter are raw scores, expressed in terms of the percentage of correct items; 
percentiles are converted scores, expressed in terms of percentage of persons. 
any obtained in the standardization sample would 
); one higher than any score in the standard- 
ntile rank of 100 (Pio). These percentiles, 
e and a perfect raw score. 
are sometimes reported in the form of 


A raw score lower than 
have a percentile rank of zero (Po 
ization sample would have a perce 
however, do not imply a zero raw scor 
In test manuals, percentile norms 


à graph, as illustrated in Figure 14. Such 
age of cases falling below each score. As in the pre- 
graphs. scores are given on the baseline, fre- 
ages) on the vertical axis. Figure 14 was 


a graph, known as an ogive, shows 


the cumulative percent 
viously discussed frequency 
quencies (i.e., cumulative percent 


100 


Qy=P75=36 
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Fig. 14. Cumulative Frequency Graph Used in Finding Percentiles. (Data from 
Table 3.) à 
3.) 
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plotted from the data of Table 3, which shows the same 1000 scores given 
in Table 1. The first two columns of Table 3, giving class intervals and fre- 
quencies, are identical with those of Table 1. In the third column are the 
cumulative frequencies, found by adding frequencies from the bottom up. 
Thus the number of cases falling at or below a score of 11 is 2: the number 
at or below 15 is 5 (3 + 2); the number at or below 19 is 13 (8 +5). In 
the fourth column, these cumulative frequencies have been changed to per- 
centages by dividing each by 10 (since N — 1000). 


TABLE 3. Cumulative Frequency Distribution 


Cumulative 


Cumulative Percentage 

Class Interval Frequency Frequency Frequency 
52-55 1 1000 100.0 
48-51 1 999 99.9 
44-47 20 998 99.8 
40-43 73 978 97.8 
36-39 156 905 90.5 
32-35 328 749 74.9 
28-31 244 421 42.1 
24-27 136 177 17.7 
20-23 28 41 4.1 
16-19 8 13 1.3 
12-15 3 5 0.5 
8-11 2 2 0.2 


The ogive in Figure 14 shows the cumulative percent: 
the upper limit of each class interval.” Such 
direction. For example 


ige frequency below 
àn ogive can be read in either 


. if we wish to find the median, we locate the 50 per 
cent point on the vertical axis, draw a horizontal 


line from this point to the 
graph, and at the point where the line meets the graph drop a perpendicular 
to the baseline. This has been done in Figure 14, showing that the median 
is 32.5. Similarly, Q, is found to be approximately 29 and Q; approximately 
36. The raw score corresponding to any other percentile can be found in 
the same manner; for the 80th percenti 


le, for example, it is 37, Working in the 
opposite direction, we can start with an individual's raw 


percentile rank corresponding to it. Thus for 
perpendicular above 27 on the baseline until i 


line drawn from that point to the vertical axis shows the percentile rank to 
be: T5. 


score and locate the 
à raw score of 27, we raise a 
t meets the curve; a horizontal 


? The observant reader may have noticed that the points on the graph have been plotted 
slightly to the right of the score values on the ba n à continuous scale, the numbers 11. 
sores, at which cumula- 
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Not only do percentiles show where the individual stands in the normative 
sample, but they are also useful in comparing the individual's own performance 
on different tests. For example, if a child obtains a raw score of 30 on an 
arithmetic test and 58 on a reading test, we cannot compare these two scores 


directly because they are expressed in different units. Suppose, however, that 
ws that a score of 30 on the arithmetic test 


reference to percentile norms sho 
5. and a score of 58 on the reading test 


corresponds to a percentile rank of 6 
to a percentile rank of 40. Now we can conclude that the child did much 
better in arithmetic than in reading. 
Percentile scores have several a 
can be readily understood, even by rel 
percentiles are universally applicable. They can be used equally well with 
adults and children, and are suitable for any type of test, whether it measures 

aptitude or personality variables. 
Q, Mdn Q; 
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dvantages. They are easy to compute and 
atively untrained persons. Moreover, 
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Fig. 15. Percentile Ranks in 
The chief drawback of percentile scores arises from the marked inequality 
Of their units especially es of the distribution. If the distribution 


at the extreme 
of raw scores approximates the normal curve, as Is true of most test scores, 
then raw score differences near 


exaggerated in the percentile trans ide i enc 
the ends of the distribution are greatly shrunk. This distortion of distances 


n be seen in Figure 15. In a normal curve, it will be recalled, 


ely at the center an 
any given percentage of cases near the center 


a Normal Distribution. 


the median or center of the distribution are 
formation, while raw score differences near 


between scores ca ; 
d scatter more widely as the extremes 


Cases cluster clos 


are approached. Consequently. ; 
Covers a shorter distance On the baseline than the same percentage near the 


ends of the distribution. In Figure 15, this discrepancy in the gaps between 
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percentile ranks (PR) can readily be seen if we compare the distance between 
a PR of 40 and a PR of 50 with that between a PR of 10 and a PR of 20. 
Even more striking is the discrepancy between these distances and that be- 
tween a PR of 10 and a PR of 1. (In a mathematically derived normal curve. 
zero percentile is not reached until infinity and hence cannot be shown on the 
graph.) , TAM 

"The same relationship can be seen from the Opposite direction if we ex- 


amine the percentile ranks corresponding to equal «-distances from the mean 


of a normal curve. These percentile ranks are given under the graph in Figure 
15. Thus the percentile difference between the mean and +lo is 34 
(84 — 50). That between + Io and 4-2« is only 14 (98 — 84), 

It is apparent that percentiles give a correct picture of each individual's 
lative position in the normative sample, but not of the amount of differencc 
between his score and that of another person. For this reason, percentiles are 


unsuitable for the computation of means, st 
other st 


Te 


andard deviations, and several 
atistical measures. The results of such computations with percentiles 


would differ from those obtained with raw scores. For example, the mean of 


two percentiles does not equal the percentile corresponding to the mean of 
the two raw scores. Percentiles thus provide 
applicable method of indicating the individ 
test norms, 


à crude though simple and widely 
ual's standing in reference to the 


STANDARD SCORES 


Current tests are making increasing use of stand 
most satisfactory type of transformed score fr 
ard scores express the individual's dist 
standard deviation of the distribution. 

Linear Standard Scores. 


ard scores, which are the 
om most points of view. Stand- 
ance from the mean in terms of the 


onstant. The relative mag- 
derived by such a linear trans- 


at between the raw scores. All properties 
of the original distribution of raw scores are duplicated in the distribution of 


these standard scores. For this Teason, any computations that can be carried 


out with the original raw scores can also be carried out with linear standard 
scores, without any distortion of results. 


Linearly derived standard scores 


formation corresponds exactly to th 


are often designated simply as "standard 
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scores,” or "z scores.” To compute a z score, we find the difference between 
the individual's raw score and the mean of the normative group, and then 
divide this difference by the SD of the normative group. Table 4 shows the 
computation of z scores for two individuals, one of whom falls 1 SD above the 
group mean, the other .40 SD below the mean. Any raw score that is exactly 
equal to the mean is equivalent to a z score of zero. It is apparent that such 
à procedure will yield derived scores that have a negative sign for all subjects 


falling below the mean. Moreover, since the total range of most groups ex- 
about 3 SD's above and below the mean. such standard 


tends no farther than 
st one decimal place in order to provide 


Scores will have to be reported to at lea 
sufficient differentiation among individuals. 


TABLE 4. Computation of Standard Scores 


M — 60 SD 


a rE M 
i SD 
JOHN'S SCORE BILL'S SCORE 
x, = 65 X, = 58 
. _ 58 — 60 
za s 
= 0.40 


= - 1.00 


—— LC e 


Both of the above conditions, viz., the occurrence of negative values and of 


decimals, tend to produce awkward numbers that are confusing and difficult 


to use for both computational and reporting purposes. For this reason, some 
further linear transformation is usually applied, simply to put the scores into 
a more convenient form. For example, the scores on the AGCT (Army Gen- 
*ral Classification Test) employed during World War II were standard scores 
Adjusted to a mean of 100 and an SD of 20. Thus a standard score of — 1 on 
this test would be expressed as 80 (100 — 20 — 80). Similarly, a standard 
Score of + 1.5 would correspond to 130 (100 + 1.5 x 20 — 130). To con- 
Vert an original standard score to the new scale, it is simply necessary to 
Multiply the standard score by the desired SD (20) and add it to or subtract 
it from the desired mean (100). ; 

Any other convenient values can be arbitrarily chosen for the new mean 
and Sp. The College Entrance Examination Board employs a mean of 500 
and an SD of 100. Scores on the separate subtests of the Wechsler Intelligence 


Cales (see Ch. 12) are converted to 
3D of 3. All such measures are examp 
Scores, 


a distribution with a mean of 10 and an 
les of linearly transformed standard 


It will be recalled that one of the reasons for 


Normalized Standard Scores. : 
derived scale is to render scores on diflerent 


Tans $ a 
‘nsforming raw scores into any 


92 Principles of Psychological Testing 


tests comparable. The linearly derived standard scores discussed in the pre- 
ceding section will be comparable only when found from distributions that 
have approximately the same form. Under such conditions, a score correspond- 
ing to 1 SD above the mean, for example, signifies that the individual occupies 
the same position in relation to both groups. His score exceeds approximately 
the same percentage of persons in both distributions, and this percentage can 
be determined if the form of the distribution is known. If, however. one 
distribution is markedly skewed and the other normal, a z score of -+ 1.00 


might exceed only 50 per cent of the cases in one group, but would exceed 
84 per cent in the other. 

In order to achieve comparability of scores from dissimilarly shaped dis- 
tributions, non-linear transformations may be employed to fit the scores to any 
specified type of distribution curve. The mental age and percentile scores 
described in earlier sections represent non-linear transformations, but they 
are subject to other limitations already discussed. Although under certain 


circumstances it may be preferable to fit scores to som 


€ other type of distri- 
bution (cf. 9, pp. 725 ff.; 


10), the normal curve is usually employed for this 


purpose. One of the chief reasons for such a practice is that raw score dis- 


tributions more often approximate the normal curve than any other type of 


curve. Moreover, physical measures such as height 
equal-unit scales derived through physical operations, generally yield normal 
distributions.? Another important advantage of the normal curve is that it has 
many useful mathematical properties, which facilitate further computations. 

Normalized standard scores are standard scores expressed in terms of a 
distribution that has been transformed to fit a normal curve. Such scores 
can be computed by reference to tables giving the percentage of cases falling 
at different o-distances from the mean of a normal curve. First, the percentage 
of persons in the standardization sample falling at or above each raw score is 
found. Such a percentage is then located in the normal curve frequency table. 
and the corresponding normalized standard score is obtained. Normalized 
standard scores are expressed in the same form as linearly derived standard 
Scores, viz., with a mean of zero and an SD of 1. Thus a norm 
zero indicates that the individual falls at the mean of a norm 
ling 50 per cent of the group. A score of — 1.00 means that he surpasses ap- 
proximately 16 per cent of the group; and a score of -+ 1.00, that he surpasses 
84 per cent. These percentages correspond to a distance of 1 SD below and 


and weight, which use 


alized score of 
al curve, excel- 


3 Partly for this reason and partly as a result of other theoretical considerations, it has fre- 
quently been argued that. by normalizing raw scores, an equal-unit scale could be developed for 
psychological measurement, similar to the equal-unit scales of Physical measurement. This, how- 
ever, is a debatable point which involves certain questionable 
duction to the logic of scales of measurement, 
logical measurement, cf. Bergman and Spence (5 


I assumptions. For a good intro- 
th special reference to problems of psycho- 
, Comrey (7), Lorge (18), and Stevens (22)- 
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1 SD above the mean of a normal curve, respectively, as can be seen by refer- 
ence to Figure 15. 

Like linearly derived standard scores, normalized standard scores can be 
put into any convenient form. If the normalized standard score is multiplied 
by 10 and added to or subtracted from 50, it is converted into a T score, a 
type of score first proposed by McCall (19). On such a scale, a score of 50 
Corresponds to the mean, a score of 60 to 1 SD above the mean, and so forth. 
Another well-known transformation is represented by the stanine scale em- 
ployed by the United States Air Force during World War Il (cf. 9, pp. 
727 ff.). This scale provides a single-digit system of scores, with a mean of 5 
and an SD of approximately 2.3 The name "stanine" (a contraction of "stand- 
ard nine") is based on the fact that the scores run from 1 to 9. The restriction 
Of scores to single-digit numbers has certain computational advantages, espe- 
Cially in machine computation. 

Raw scores can readily be converted to stanines by arranging the original 
Scores in order of size and then assigning stanines in accordance with the 


normal curve percentages reproduced in Table 5. For example. if our group 


TABLE 5. Normal Curve Percentages for Use in Stanine Conversion 


Percentage 4 7 1 
Stanine 1 2 


Consists of exactly 100 persons, the four lowest-scoring persons receive a 
Stanine score of 1, the next 7 a score of 2, the next 12 a score of 3, and so on. 
ains more or fewer than 100 cases, the number corre- 
Sponding to each designated percentage is first computed; these numbers of 
Cases are then given the appropriate stanines. Thus out of 200 cases, 8 would 
be assigned a stanine of 1 (4 per cent of 200 = 8). With 150 cases, 6 would 
receive a stanine of 1 (4 per cent of 150 — 6). 

Although normalized standard scores are the most satisfactory type of score 
for the majority of purposes. there are nevertheless certain technical objections 
to normalizing all distributions routinely. Such a transformation should be 
Carried out only when the sample is large and representative and when there 
1s reason to believe that the deviation from normality results from defects in 
the test rather than from characteristics of the sample or from other factors 


affecting the behavior under consideration. It should also be noted that when 
n of raw scores approximates normality, the linearly 
d the normalized standard scores will be very 


When the group cont 


4 Kaise E d a slight modification of the stanine scale which yields an SD of 
ridus propig handle quantitatively. 


exac E ^ 
Ctly 2, thus being easier 
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similar. Although the methods of deriving these two types of scores are quite 
dissimilar, the resulting scores will be nearly identical under such conditions. 
Obviously, the process of normalizing a distribution that is already virtually 
normal will produce little or no change. Whenever feasible, it is generally 
more desirable to obtain a normal distribution of raw scores by proper ad- 
justment of the difficulty level of test items, rather than by subsequently nor- 
malizing a markedly non-normal distribution. With an approximately normal 
distribution of raw scores, the linearly derived standard scores will serve the 
same purposes as normalized standard scores. 

Normal Percentile Graphs. Increasing use is being made of a graphical 
technique for reporting test scores that combines some of the advantages of 
percentiles and normalized standard scores. Reference to Figure 15 will sug- 
gest how such a combination can be effected. If successive percentile units are 
spaced closer together at the center and farther apart as the extremes of 
the distribution are approached, they can be made to correspond to equal 
units on the baseline of a normal curve. The plotting of such percentile points 
is facilitated by the use of arithmetic probability paper, a cross-section paper 
in which the vertical lines are spaced in the same way as the percentile points 
in a normal distribution, while the horizontal lines are uniformly spaced. 

When percentile scores are indicated on such a graph, the relationships 
among them can be properly visualized in terms of normally distributed scores. 
For example, if individuals A, B, C, and D receive percentile scores of 10, 20. 
40, and 50, respectively, it will be apparent on such a graph that the differ- 
ence in performance between individuals A and B is much larger than that 
between individuals C and D. If standard scores are also indicated along the 
axis of such a graph, percentiles can be converted to standard scores at à 
glance. 

A relatively early use of such a combination of percentiles 
scores is to be found in the Normal Percentile Chart prepared by Otis (21). 
This chart is designed for recording the performance 
or two tests. It can be used to facilitate the comput 
both percentiles and normalized standard scores, as well as for a number of 
other purposes related to the interpretation of test results. 

Normal percentile graphs have also proved helpful in plotting individual 
profiles. Such profiles show the subject's relative standing on different tests, all 
scores being expressed in comparable units and with reference to a common 
norm. The profile method of reporting scores has become especially prom- 


inent in connection with the growing use of batteries for the differential testing 
of aptitudes. Among the current batteries th 


and standard 


of a whole group on one 
ation of norms in terms o! 


at employ normal percentile 
graphs in plotting profiles are the tests of Primary Mental Abilities published 
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by Science Research Associates and the Differential Aptitude Tests prepared 
by The Psychological Corporation. Since the percentile points are spaced in 
accordance with normal curve distances, the individual's relative standing on 
different tests is not distorted. If, as in the Differential Aptitude Tests, standard 


score equivalents are also provided on the graph, numerical scores that are 


likewise frec from distortion can be read directly from the graph. Part of a 


sample report form from the Differential Aptitude Tests is reproduced in 


Figure 16. 
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Report Form for Use with the Differential Aptitude Tests. 
Percentile Graphs in Plotting Profiles. (From 
eproduced by permission of The Psychological 


Fig. 16. Individual 
lllustrating the Application of Normal 
Bennett, Seashore, and Wesman. 4. P- 2 

Orporation.) 


The Deviation IQ. Another adaptation of standard scores is to be found 
in the recently developed concept of a deviation IQ. These so-called 10's are 
actually standard scores with a mean of 100 and an SD that approximates 
the SB of che fuenttiay Stanford-Binet 10 distribution, Although the SP of the 
actly constant at all ages, it fluctuated 


1937 Stanford-Binet IQ was not ex: M 
around a median value slightly greater than 16 (24, p. 40). Hence if an SD 
Close to 16 is chosen in reporting standard scores on a newly developed test, 


the resulting scores can be inter s 2s 
Since Stanford-Binet 1Q’s have been in use for many years, testers and clini- 


cians have become accustomed to interpreting and classifying test performance 


preted in the same way as Stanford-Binet 1Q’s. 
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in terms of such IO levels. They have learned what to expect from hacc 
with IQ's of 40, 70, 90, 130, and so forth. There are therefore cnain praeina 
advantages in the use of a derived scale that corresponds to the familiar distri- 
bution of Stanford-Binet IQs. Such a correspondence of score units can be 
achieved by the selection of numerical values for the mean and SD that agree 
closely with those in the Stanford-Binet distribution. 

It should be added that the use of the term “IQ” to designate such standard 
scores may at first be somewhat misleading. Such IQ's are not derived by the 
same methods employed in finding traditional ratio IQ's. They are not ratios 
of mental ages and chronological ages. For this reason, some criticism of the 
practice of calling such scores *IQ's" has been expressed. The justification 
lies in the general familiarity of the term "IO." and in the fact that such scores 
can be interpreted as IO's provided that their SD is approximately equal to 
that of previously known 1Q’s. The more precise expression, “deviation IQ." 
should eventually succeed in eliminating any confusion regarding the deriva- 


tion of such measures. Among the first tests to express scores in terms of devia- 
tion IO's were the Wechsler Intelligence Scales. 
100 and the SD 15. Deviation 1Q’s are 


tests of intelligence and in the 1960 revision of the Stanford-Binet. 

At this stage in our discussion of converted scores, 
become aware of a rapprochement 
centiles have gradually been t 
malized standard scores. 
malized stand 


In these tests, the mean is 
also used in a number of current group 


the reader may have 
among the various types of scores. Per- 
aking on at least a graphic resemblance to nor- 
Linear standard scores are indistinguishable from nor- 
ard scores if the original distributi 
proximates the normal curve. Finally, st 
Vice versa. In connection with the last point, a re-examination of the meaning 
of a ratio IO on such a test as the Stanford-Binet will show that these 1Q’s 
can themselves be interpreted as standard scores. If we know that the 1937 
Stanford-Binet IQ distribution had a mean of 100 and an SD of 
16, we can conclude that an IO of 116 falls at a distance of 1 SD above the 
mean and represents a standard score of + 1.00. Similarly, an IQ of 132 cor- 
responds to a standard score of +2.00, an IQ of 76 to a standard score of 
— 1.50. and so forth. Moreover, a Stanford-Binet ratio IO of 116 corresponds 
to a percentile rank of approximately 84, since it will be recalled that in à 
normal curve 84 per cent of the cases fall below +. 1.00 SD (Fig. 15). 

In Figure 17 are summarized the relationships that exist in a normal dis- 
tribution among the types of scores discussed in 
z scores. AGCT scores, College Entrance Examination Board (CEEB ) scores, 
Wechsler deviation IO's (SD = 15), T Scores, stanines, and percentiles. Ratio 
1Q’s on any test will coincide with the given deviation IQ scale if they are 


on of raw scores closely ap- 
andard scores are becoming IQ's, and 


approximately 


this chapter. These include 
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normally distributed and have an SD of 15. Any other normally distributed 
IO could be added to the chart, provided we knew its SD. If the SD is 20, for 
instance, then an IQ of 120 corresponds to + le. an IQ of 80 to — lø, etc. 
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Fip, 17. Relationships among Different Types of Test Scores in a Normal Distribution. 


the exact form in which scores are reported is dictated 
niliarity, and ease of developing norms. Standard 
ng the deviation 1Q) appear to be gradually re- 
because of certain advantages they offer with 


1 In conclusion. 
argely by convenience, fan 
Scores in any form (includi 


Placing other types of scores. 
Tegard to test construction and statistical treatment of data. All types of con- 


Verted scores, however, are fundamentally similar if carefully derived and 
Properly interpreted. When certain statistical conditions are met. each of 
these scores can be readily translated into any of the others. 
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SPECIFICITY OF NORMS 


Any norm, however expressed, is restricted to the particular sis arn 
population from which it was derived. The test user should never lose sight > 
the way in which norms are established. Psychological test norms are in no 
sense absolute, universal, or permanent. They merely represent the test DER 
formance of the subjects constituting the standardization sample. In choosing 
such a sample, an effort is usually made to obtain a representative cross sec- 
tion of the population for which the test is designed. 

In statistical terminology, a distinction is made between sample and popu- 
lation. The former refers to the group of individuals actually tested. The lat- 
ter designates the larger, but similarly constituted, group from which the sam- 
ple is drawn. For example, if we wish to establist 
for the population of native-born, white, 
boys, we might test 


h a norm of test performance 
10-year-old, urban, public school 
a carefully chosen sample of 500 native-born, white. 10- 
year-old boys attending public schools in sever: ; 
would be checked with reference to geographical distribution, socioeconomic 


level, and other relevant characteristics to insure that it was truly representa- 
tive of the desired population. 


In the development and 
should be given to the stan 


al American cities. The sample 


application of test norms, considerable attention 


same population should not 
yield norms that diverge appreciably from those obtained. Norms with a 


large “sampling error” would obviously be of little value in the interpretation 
of test scores. 


Equally important is the re 
the population under conside 
the sample unrepresent 
such selective factors a 


quirement that the sample be representative of 
ration. Subtle selective factors that might make 
ative should be carefully investigated. A number of 
re illustrated in institutional samples. Since such sam- 
ples are usually large and readily available for testing purposes, they offer an 
alluring field for the accumulation of normative data. The Special limitations 


s, however, should be carefull 
school. for example, will 


the successive grades, 


y analyzed. Testing subjects in 
yield an increasingly superior selection of cases in 
owing to the progressive dropping out of the less 
Such elimination affect differ: 


f selective elimination E is g 
than for girls (cf. 2, p. 456), and it is greater in lower than in higher socio- 
economic classes (15). 
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Selective factors likewise operate in other institutional samples, such as 
prisoners, patients in mental hospitals, or institutionalized mental defectives. 
Because of many special factors that determine institutionalization itse'f, such 
groups are not representative of the entire population of criminals, psychotics, 
or mental defectives. Mental defectives with physical handicaps, for example, 
are more likely to be institutionalized than are the physically fit. Similarly. 


the relative proportion of lower-grade defectives will be much greater in insti- 


tutional samples than in the total population. 

Closely related to the question of representativeness of sample is the need 
for defining the specific population to which the norms apply. Obviously. one 
way of insuring that a sample is representative is to restrict the population to 
fit the specifications of the available sample. For example, if the population 
is defined to include only 14-year-old school children, rather than all 14- 
year-old children, then a school sample would be representative. Ideally. of 
course, the desired population should be defined in advance in terms of the 
objectives of the test. Then a suitable sample should be assembled. Practical 

sever, often make such a goal unattainable. 


Obstacles in obtaining subjects. how 
In such a case, it is far better to redefine the population more narrowly than 


an ideal population which is not adequately represented 
by the standardization sample. In actual practice, very few tests are stand- 
ardized on such broad populations as is popularly assumed. No test provides 
norms for the human species! And it is doubtful whether any tests give truly 
adequate norms for such broadly defined populations as “adult American 
men," "American 10-year-old children." and the like. 

Because of the differences in the nature of the samples upon which differ- 
ardized. the same individual may appear to be su- 
and average or inferior when measured by 
if one set of norms were based on col- 


to report norms on 


ent tests have been stand 
perior when measured by one test 


another. To take an extreme example. 
lege freshmen and another on unselected persons of the same age. any one 


individual would obviously rate higher in terms of the latter norms than in 
terms of the former. Another possible reason for the apparent discrepancy in 
an individual's performance on different tests designed ion the same purpose 
is to be found in a lack of comparability of units. As explained in the preced- 
ing section, if the IQ's on one test have an SD of, let us say, 10, and the 
IO's on another have an SD of 15. then an individual who received an IO 
of 110 on the first test will probably obtain an IQ of 115 on the second. 

It should also be noted that performance may vary appreciably from one 
test to another because the tests differ in content, despite the fact that such 
lests may be given the same label. So-called intelligence ius provide many 
illustrations of this situation. Although commonly described by the same 
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blanket term, intelligence tests often vary considerably in the fonenn they 
measure. To be sure, lack of comparability of either content or test bg wt 
usually be detected by reference to the test itself or to the test manual ws 
ferences in the respective normative samples, however, are more likely to be 
overlooked. Such differences probably account for many otherwise unex- 
plained discrepancies in test results. NM. ate 

There is considerable evidence to indicate that scores on many tests that 
are commonly used interchangeably do in fact differ materially. In the 
Harvard Growth Study (8), the performance of elementary school children 
on a number of common group intelligence tests revealed consistent differ- 
ences among the tests. For example, the median IO of the entire group of 
320 children was 94 on one test, 102 on another, and 110 on a third. Simi- 
larly, an IQ of 108 on the first-mentioned test corresponded to an 1Q of 
118 on the second and to an IQ of 124 on the third, all three scores falling at 
the 80th percentile of the group. In the light of the marked variations found, 
the authors conclude that “an IQ of 100, which is commonly interpreted as 
indicating average ability and a position near the center of an unselected 
group, represents, on tests given for the first time, positions varying from the 
19th to the 65th percentile . . . from one in the lower quarter of the group, 
representing an ability which is Supposed to approximate dullness, to one 
near the upper third of the distribution, indicating brightness of a promising 
nature" (8, p. 134), 

Another, more recent analysis illustrating the same point was based upon 
data gathered by the World Book Company on high school students (17). 
Three groups of students were selected so as to be closely matched in age. 
grade, age-within-grade, and performance on a comprehensive academic 
achievement battery. The three groups thus assembled totaled nearly 1200 


cases. Each of the matched groups had taken one of three widel 


y used group 
intelligence tests, viz., Terman-McNemar Test of Mental Ability, Otis Quick 
Scoring Mental Ability Tests, and Pintner General Ability Tests (Verbal 
Series). From these results, a table of equivalent IQ's on the three tests was 


prepared by finding the scores th 
the three distributions. This 
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example, corresponds to an IQ of 76 on the Otis. while an IQ of 141 on the 
Terman-McNemar corresponds to one of 134 on the Otis. 

To be truly comparable. scores on different tests must be expressed not 
only in the same units, but also with reference to comparable standardization 
samples. When these conditions are not met, adjustments or corrections 
should be applied, as illustrated by the table of equivalent 1Q’s cited above. 
In any event, an IO or any other score should always be accompanied by 
the name of the test on which it was obtained. Test scores cannot be properly 
interpreted in the abstract; they must be referred to particular tests. If the 
school records show that Bill Jones received an IQ of 94 and Tom Brown an 
IO of 110, such IO's cannot be accepted at face value without further in- 
formation. The positions of these two students might have been reversed by 
exchanging the particular tests that each was given in his respective school. 
individual's relative standing in different functions 

through lack of comparability of test norms. 
dual has been given a verbal comprehension 
his relative standing in the two 


Similarly. any one 
may be grossly misrepresented 
Let us suppose that an indivi 
and a spatial aptitude test to determine 
fields. If the verbal ability test was standardized on a random sample of high 
atial test was standardized on a selected group 
the examiner might erroneously 
able along verbal than along spa- 


School students, while the sp 
Of boys attending elective shop courses. 
conclude that the individual is much more 


tial lines, when the reverse may actually be the case. 


Much of the confusion in the interpretation of scores from different tests 


could probably be 
Specific terms. It is very difficult to obt i 
broadly defined populations. Consequently. the samples obtained by differ- 
ent test constructors often tend to be unrepresentative and biased in different 
ways, and the resulting norms are not comparable. As already suggested. a 
and effective procedure is to standardize the test on a more 
narrowly defined population. chosen to suit the specific purposes of the test. 
The limitations of such a population should then be: clearly stated in report- 
ing the norms. For example, a test may be standardized on "American-born 
Children in urban, Midwestern public schools, or upon “employed clerical 
Workers in large business organizations. or upon "Distsyear engineering 
Students," If the test proves its worth in actual practice, it may be deemed ad- 
the boundaries of the normative population by testing 
description of the norms would then be accordingly 


avoided if the normative populations were defined in more 
ain truly representative samples of 


more practicable 


Vantageous to expand 
additional samples. The 
generalized. 


For many purposes highly specific norms are desirable. Thus, 


however, 
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even when representative norms are available for a broadly defined pode 
tion, it is often helpful to have separately reported subgroup norms. This is 
true whenever recognizable subgroups yield appreciably different scores on a 
particular test. The subgroups may be formed with respect to age, grade. 
type of curriculum, sex, geographical region, urban or rural environment, so- 
cioeconomic level, and many other factors. The use to be made of the test 
determines the type of differentiation that is most relevant, as well as whether 
general or specific norms are more appropriate (3, 26). 

That subgroup norms do reveal marked discrepancies in many tests has 
been repeatedly demonstrated. On the Differential Aptitude Tests, for exam- 
ple, sizable sex differences in favor of boys were found in the Space Relations 
and Mechanical Reasoning Tests, while the girls clearly excelled in the Cler- 
ical, Spelling, and Sentences Tests (29). If a boy obtains a raw score of 40 
on the Mechanical Reasoning Test, he would fall at approximately the 75th 
percentile in the combined distribution for both sexes. He might thus appear 


to show promise for a curriculum or vocation requiring mech 


anical under- 
standing. 


In such a curriculum or occupation, however, he would compete 
almost entirely with other males. When evaluated in terms of the male dis- 
tribution only, his raw score of 40 pl 
would therefore be just 
whom he would have to 
on the Clerical Test wo 
tribution, but at the 30t 
centile is of more practi 
clerical training 


aces him only at the 50th percentile. He 
average in reference to the type of individuals with 
compete. Similarly, a girl receiving a raw score of 53 
uld fall at the 40th percentile in the combined dis- 
h percentile in the girls’ distribution. The latter per- 
cal significance, since the competition encountered in 
and in clerical jobs would be predominantly female (29). 

Another illustration of the Practical implications of subgroup norms is 


provided by a comparison of public and private school norm 
the Educational Records Bureau in re 


and achievement tests (26). 


large part for these 
however, the evalua- 
which set of norms 
à private school, for ex- 
a higher norm. 

often developed by the test 


users themselves within a particular setting. The groups employed in deriv- 


Norms: Their Nature and Interpretation 103 


ing such norms are even more narrowly defined than the subgroups consid- 
ered above. Thus an employer may accumulate norms on applicants for a 
given type of job within his company. Or a college admissions office may de- 
velop norms on its own student population. Similarly, in selecting pilots or 
bombardiers, the Air Force utilized norms on applicants for pilot and bom- 
bardier training. These norms obviously permit a more accurate prediction 
of the individual's performance than could be made with norms derived 
from "men in general." 

In summary. normative populations should be clearly defined. The charac- 
teristics of such populations should be t 
ing test scores. For many purposes. moreo 
narrowly defined normative populations, are more useful than general norms. 


aken into consideration in interpret- 
ver, specific norms, based on more 
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CHAPTER 9 


EAE 


Test Reliability 


The reliability of a test refers to the consistency of scores obtained by the 
on different occasions or with different sets of equivalent 


same individuals 
es the error of measurement of a 


items. This concept of reliability underli 
single score, whereby we can predict the range of fluctuation likely to occur 
in a single individual's score as à result of irrelevant, chance factors 

Test reliability should not be confused with the reliability of statistical 
measures. When, for example, we speak of the reliability of means, standard 
deviations, or correlations, OT when we inquire whether a difference betwen 
tistically significant, we refer primarily to sampling error. In 
tency of these statistical measures 


two means is sta 


other words, we wish to know the consis 
a different sample of the same population. Such a 


when redetermined on à 1 ; 
asic to the interpretation of any experimental results, 


question is obviously b à 
since the conclusions of scientific investigations are rarely, if ever, restricted 


to the particular sample studied. - 
In common with other psychological investigations, analyses of test scores 


frequently involve questions of sampling error. Such an error, however. 
should be differentiated from the error of measurement, with which the pres- 
ent chapter is concerned. Sampling error pertains to the consistency of results 
Obtained when observations are repeated on different. individuals; error of 
measurement, to the consistency. of results obtained when observations are 


repeated on the sane individuals. 


TYPES OF TEST RELIABILITY 


The concept of test reliability itself has been used to cover not one but 


Several aspects of score cons; 
tion has been called repeatedly to 
reliability," and several more spec! 


istency. During the past quarter-century, atten- 
the ambiguity of the blanket term, “test 
fic designations have been proposed from 
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time to time. The Technical Recommendations for Psychological Tests and 
Diagnostic Techniques (1), prepared by the American Psychological Asso- 
ciation, have. helped to systematize terminology in this area. It should be 
noted that no one type or measure of test reliability is universally preferable. 
The choice depends upon the use to which the test scores are to be put. 

In its broadest sense, test reliability indicates the extent to which individual 
differences in test scores are attributable to chance errors of measurement. 
and the extent to which they are attributable to true differences in the char- 
acteristic under consideration. To state it in more technical terms, every 
measure of test reliability denotes what proportion of the total variance of 
test scores is “error variance.” The crux of the matter. however, lies in the 
definition of error variance. Factors that might be considered error variance 
for one purpose would be classified under true variance for another. For ex- 
ample, if we are interested in measuring fluctuations of mood, then the day- 
by-day changes in scores on a test of cheerfulness-depression would be 
vant to the purpose of the test and would hence be part of the true varian 
the scores. If, on the other hand, the test is desi 
nent personality characteristics, the same d 
the heading of error variance. 

Essentially, any condition th 
sents error variance, Thus wh 


rele- 
ce of 
gned to measure more perma- 
aily fluctuations would fall under 


at is irrelevant to the purpose of the test repre- 
en the examiner tries to maintain uniform test- 
ing conditions by controlling the testing environment, instructions, time limits, 
rapport, and other similar factors, he is reducing error variance and making 
the test scores more reliable. Despite Optimum testing conditions, however, 
no test is a perfectly reliable instrument. Hence every test should be ac- 
companied by a statement of its reliability. Such a measure of reliability 
characterizes the test when administered under standard conditions and given 
to subjects similar to those constituting the normative sample. The charac- 
teristics of such a sample should therefore be specified, together with the 
type of reliability that was measured. 


There could, of course, be as many varieties of test reliabili 


ty as there are 
conditions affecting test scores, since an 


y such conditions might be irrelevant 
for a certain purpose and would thus be Classified as error variance. The 
ce. however, are relatively few. 


rlying the common measures of 
test reliability will be considered below. In a later section of the chapter, the 


relation between different concepts of reliability and specific techniques cur- 
rently employed to measure test reliability will be examined. 
Temporal Stability. An obvious source of error vari 


ance for most test- 
ing purposes is to be found in the random fluctuations of 


performance occur- 
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ring from one test session to another. These variations may result in part from 
uncontrolled testing conditions, such as extreme changes in weather, sudden 
noises and other distractions, or a broken pencil point. To some extent, how- 
ever, they arise from changes in the condition of the subject himself, as illus- 
trated by illness, fatigue, emotional strain, worry, recent experiences of a 
the like. Temporal stability indicates the 
affected by the random daily fluctuations 
It is obvious 


pleasant or unpleasant nature, and 
degree to which scores on a test are 
in the condition of the subject or of the testing environment. 
a test depends in part upon the length of in- 


that the temporal stability of 
d. Illustrations could readily be cited of 


terval over which stability is measure 
over periods of a few days or weeks, but whose 
scores reveal an almost complete lack of correspondence when the interval is 
extended to as long as ten or fifteen years. Many preschool intelligence tests, 
for example, yield moderately stable measures within the preschool period, 
but are virtually useless as predictors of late childhood or adult IQ's. 

In actual practice, however, a simple distinction can usually be made. 


Short-range, random fluctuations that occur during intervals ranging from a 
e generally included under the error variance of 
pe of test reliability. an effort is made to 


tests showing high reliability 


few hours to a few months ar 
the test score. Thus in checking this ty 
keep the interval short. In testing young children. the period should be even 
Shorter than for older subjects. since 
Changes are discernible over a period of 
Subject, the interval between retests should rare 

Any additional changes in the relative test performance of individuals that 
Occur over longer periods of time 
rather than entirely random. Moreover. they are likely to characterize a 
broader area of behavior than that covered by the test performance itself. 
Thus an individual's general level of scholastic aptitude, mechanical compre- 
hension, or artistic judgment may have altered appreciably over a ten-year 
Period, owing to unusual intervening experiences. The individual's status may 
have either risen or dropped appreciably in relation to others of his own age. 
because of circumstances peculiar to his own home, school, or community en- 
Vironment, or for other reasons such as i 

The extent to which such factors can affect an individual's psychological 
development provides an important problem for investigation. This question, 
however, should not be confused with the stability of a particular test. Thus 
When we measure the reliability of the Stanford-Binet or the Minnesota Pre- 
School Test, we would not ordinarily check the stability of the scores over a 
Period of ten years, or even one year, but over a few weeks. To be sure, long- 


Tange retests have been conducted wit 


at early ages progressive developmental 
a month or even less. For any type of 
ly exceed six months. 


are apt to be cumulative and progressive 


IIness or emotional disturbance. 


h such tests, but the results are gener- 
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i ed in terms of the “constancy of the IQ" or the predictability of 
s Boe ce from childhood performance, rather than in terms of the 
ied es u articular test. The concept of reliability is generally re- 
were oe random changes that characterize the test performance 
Mh Heri havior. 
s cape sang og a behavior functions may themselves vary 
in ES pas of daily fluctuation they exhibit. For example, Ss 
delicate finger movements is undoubtedly more susceptible to slight pas 
in the subject’s condition than is verbal comprehension, If we wish to o ve 
an over-all estimate of the individual's habitual inger steadiness, we ios 
probably require repeated tests on several days, while a single test ipie 
would suffice for verbal comprehension. Again we must fall back on an anal- 
ysis of the purposes of the test and on a thorough understanding of the be- 
havior the test is designed to predict. 

Item Sampling. Everyone has probably had the experience of taking à 
course examination in which he felt he had a "lucky break" because many of 
the items covered the very topics he happened to have studied most care- 
fully. On another occasion, he may have had the opposite experience, finding 
an unusually large number of items on areas he had failed to review. This 


familiar situation illustrates a second source of error variance in test scores- 
To what extent do scores on this test 


particular selection of items? If a di 
ently , were to prep 
how much would a 


depend upon factors specific to the 


fierent investigator, working independ- 
are another test in accordance with the same specifications, 
n individual's score differ on the two tests? 
Let us Suppose that a 40-item vocabul 


measure of general word comprehension. 
40 diflerent words is assembled for the 


ary test has been constructed as à 
Now suppose that a second list of 


subjects, the relative diffi- 


y somewhat from person to person. Thus the 
first list might contain a larger number of words unfamiliar to individual A 
than does the second list. The second 


disproportionately 1: 


of these two subjects will 
therefore be reversed on the two lists, Owing to chance differences in the se- 
lection of items. 
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Homogeneity of Items. Test homogeneity refers essentially to consistency 
of performance on all items within a test. For example. if one test includes 
only multiplication items, while another comprises addition, subiraction, mul- 
tiplication, and division items, the former test will probably show more inter- 
item consistency than the latter. In the latter, more heterogeneous test. one 
subject may perform better in subtraction than in any of the other arithmetic 
operations; another subject may score relatively well on the division items, 
but more poorly in addition, subtraction, and multiplication; and so on. ^ 
more extreme example would be represented by a test consisting of 40 
vocabulary items. in contrast to one containing 10 vocabulary, 10 spatial rela- 
tions, 10 arithmetic reasoning. and 10 perceptual speed items. In the latter 
test, there might be little or no relationship between a subject’s performance 
on the different types of items. 

It is apparent that test scores will be less ambiguous when derived from 
relatively homogencous tests. Suppose that in the highly heterogeneous, 40- 
item test cited above. individuals A and B both obtain a score of 20. Can we 
conclude that the performances of the two subjects on this test were equal? 
Not at all. Subject A might have correctly completed 10 vocabulary items, 
and none of the arithmetic reasoning and spatial 


10 perceptual speed items. 
Subject B could have received a score of 20 by 


relations items. In contrast, 
the successful completion of 5 perceptual speed, 5 spatial relations. and 10 


arithmetic reasoning items. 

Many other combinations could obviously produce the same total score of 
20. Such a score would have a very different meaning when obtained through 
such dissimilar combinations of items. In the relatively homogeneous vocabu- 
lary test, on the other hand, a score of 20 would probably mean that the sub- 
ject had succeeded with approximately the first 20 words, if the items were 
arranged in ascending order of difficulty. He might have failed two or three 
easier words and correctly responded to two or three more difficult items be- 
yond the 20th, but such individual variations are slight in comparison with 
those found in a more heterogeneous test. 

A highly relevant question in this connection is whether the criterion that 
the test is trying to predict is itself relatively homogeneous or heterogeneous. 
Although homogeneous tests are to be preferred because their scores permit 
fairly unambiguous interpretation, a single homogeneous test is obviously 
Not an adequate predictor of a highly heterogeneous criterion. Moreover, in 
the prediction of a heterogeneous criterion, the heterogeneity of test items 
Would not necessarily represent error variance. Traditional intelligence tests 
Provide a good example of heterogeneous tests designed to predict a heter- 
Ogeneous criterion. In such a case, however. it may be desirable to construct 
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several relatively homogeneous tests, each measuring a different phase of vi 
heterogeneous criterion. Thus unambiguous interpretation of test scores could 
be combined with adequate criterion coverage. . 

How does homogeneity differ from adequacy of item sampling, discussed 
in the preceding section? An extreme example will serve fo highlight the dif- 
ference. Suppose every item in a certain test measures a different and unre- 
lated function. It would be entirely possible to construct another, parallel 
form of such a test containing the same types and distribution of items as 
the first form. The scores on the two forms could theoretically agree closely, 
thus indicating high test reliability in terms of item sampling. The homogene- 
ity of this test, however, would be close to zero, since the consistency of 
performance from item to item within either form would be no better th 
chance. 

Whether homogeneity should be classified under reliability is a debatable 
point. Many psychometricians would agree that homogeneity can be more 
properly regarded as a separate property of tests, distinct from the traditional 
concepts of either reliability or validity. It is included in the 
sion, however, since it enters into certain measures of reliabi 
sidered in a later section of the chapter. In any event, the conc 
neity should be clearly distinguished from the other 
already discussed. 

Examiner and Scorer Reliability. It should now be 


concepts of test reliability vary in the factors they s 
ance. 


an 


present discus- 
lity to be con- 
ept of homoge- 
forms of reliability 


apparent that the different 


ubsume under "error vari- 
" [n one case, error variance covers temporal 


refers to differences between sets of parallel items; and in still another, it in- 
cludes any interitem inconsistency. On the other hand, the factors excluded 
from measures of error variance are broadly of two types: (a) those factors 
whose variance should remain in the Scores, since they are part of the true 
differences under consideration; and (b) those irrelevant factors that can be 
experimentally controlled. For example, it is not customary to report the er- 
ror of measurement resulting when a test is administered under distracting 
conditions or with a longer or shorter time limit than that specified in the 
manual. Timing errors and serious distractions can be empirically eliminated 
from the testing situation. Hence it is not necessary to report special reliability 
coefficients corresponding to "distraction variance" or "timing variance." 
Similarly, most tests provide such highly standardized procedures for ad- 
ministration and scoring that "examiner reliability" and "scorer reliability” 
can be assumed to be sufficiently high for practical purposes. There is thus 
no special need for measuring such types of reliability. This is particularly 
true of group tests designed for mass testing and machine scoring. In such 


fluctuations; in another. it 
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tests we need only to make certain that the prescribed procedures are fol- 
lowed carefully. The problem is thus one of empirical control of conditions. 

In certain individual tests, however, the role of the examiner is far more 
complex. As illustrations may be cited the Stanford-Binet and most preschool 
tests. The testing procedure in such cases is not so rigidly standardized. Much 
depends upon the examiner’s success in establishing rapport and arousing 
adequate motivation. Often the subject's performance needs to be evaluated 
by the examiner during the process of test administration, since such per- 
formance determines how the examiner is to proceed with the testing. Under 
such conditions, it is likely that even properly qualified examiners may some- 
times obtain different results from the same subjects. These variations in score 
would constitute error variance attributable to individual differences or idi- 
Osyncrasies among examiners. 

An illustration of “examiner variance" is provided by an analysis of Stan- 
ford-Binet IQ's obtained by different examiners in the Harvard Growth 
Study (5). Differences as large as 13 points were found between the mean 
IQ's reported by two examiners for the same group of subjects. For individual 
subjects, differences of as much as 30 or 40 points were noted. In tests in 
Which examiner idiosyncrasy may play an appreciable part, it appears desira- 
ble to obtain some measure of the “examiner reliability” of the test, espe- 
cially when results by several examiners are to be combined. 

Similarly, certain types of tests present a problem of "scorer reliability." 
ation of the Goodenough Draw-a-Man Test of intelligence, for 


In an investig ! 
386 children were scored independently by three 


example, the drawings by : ; 
For about 25 per cent of the cases, interscorer dis- 


amounted to a year or more of mental age. Such 
fact that a fairly objective system of credit 


trained scorers (18). 
Crepancies were found that 
Variations occurred despite the 
Points has been developed for scoring this test, a system in which the three 
Scorers had been thoroughly trained. ; : 

With the widespread use of projective techniques as measures of per- 
Sonality, the question of scorer reliability is Seer) 
Many current projective techniques leave much to the subjective interpreta- 
tion of the scorer, who is also usually the examiner. With such well-known 
instruments as the Rorschach inkblot test, for example, the lack of consist- 
*ncy sometimes found between the diagnoses reached from the same records 
by different experienced scorers is truly astounding (cl, e.g., 11). When 
Specific response categories are compared, the degree of scorer agreement is 
Much higher but still falls far short of perfect reliability (et. 8, 20, 213. Hor 
Such tests, there appears to be fully as much need for an index of scorer reli- 


abilit e usual measures of reliability. 
y as for the mor 


recciving increasing attention. 
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TECHNIQUES FOR MEASURING TEST RELIABILITY 


The Correlation Coefficient. Since all types of test reliability are concerned 
with the degree of consistency or agreement between two independently de- 
rived sets of scores, they can all be expressed in terms of a correlation coefti- 
cient, whose statistical symbol is r. A discussion of correlation coefficients can 
be found in any elementary text on statistics (cf., e.g., 4, 9, 10, 12, 24, 25). 


For the present purpose, it will suffice to note some of the principal charac- 


teristics of such coefficients. Essentially, a correlation coefficient expresses the 


degree of correspondence, or relationship, between two sets of scores. Thus 
if the top-scoring individual in variable 1 also obtains the top score in varia- 
ble 2, the second-best individual in variable 1 is second best in variable 2 
and so on down to the poorest individual in the group, 
perfect correlation between variables 1 and 2 
a value of + 1.00. 

A hypothetical illustration of 
Figure 18. 


then there would be a 
- Such a correlation would have 


a perfect positive correlation is shown in 
In this figure will be found a scatter diagram, or bivariate dis- 
tribution. Each tally mark in this diagram indicates the score of one individual 
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Fig. 18. Bivariate Distribution for a Hypothetical Correlation of 4- 1.00. 
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in both variable 1 (horizontal axis) and variable 2 (vertical axis). It will 
be noted that all of the 100 cases in the group are distributed along the 
diagonal running from the lower left- to the upper right-hand corner of the 
diagram. Such a distribution indicates a perfect positive correlation (4- 1.00). 
since it shows that each individual occupies the same relative position in both 
variables. The closer the bivariate distribution of scores approaches this 
diagonal, the higher will be the positive correlation. 

Figure 19 illustrates a perfect negative correlation (— 1.00). In this case, 
there is a complete reversal of scores from one variable to the other. The best 
individual in variable 1 is the poorest in variable 2 and vice versa, this re- 
Versal being consistently maintained throughout the distribution. It will be 
noted that, in this scatter diagram, all individuals fall on the diagonal extend- 
ing from the upper left- to the lower right-hand corner. This diagonal runs 
in the reverse direction from that in Figure 18. 

A zero correlation indicates complete absence of relationship, such as 
might occur by chance. If each individual's name were pulled at random out 


Of a hat to determine his position in variable 1, and if the process were re- 
2. a zero or near-zero correlation would result. Under 


Peated for variable 2. 
these conditions, it would be impossible to predict an individual's relative 


Standing in variable 2 from a knowledge of his score in variable 1. The top- 
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scoring subject in variable 1 might score high, low, or average in Apc hs 
Some individuals might by chance score above average in both vene E i” 
below average in both; others might fall above average in one Variable 7 : 
below in the other; still others might be above the average in one and at c) 
average in the second, and so forth. There would be no regularity in the rela 
tionship from one individual to another. DN 
The coefficients found in actual practice generally fall between buen 
tremes, having some value higher than zero but lower than 1.00. Correlations 
between measures of abilities are nearly always positive, although et 
quently low. When a negative correlation is obtained between two such 
variables, it usually results from the way in which the scores are expressed. 
For example, if time scores are correlated with amount scores, a negative cor- 
relation will probably result. Thus if each subjects score on an arithmetic 
computation test is recorded as the number of seconds required to complete 
all items, while his score on an arithmetic reasoning test represents the num- 
ber of problems correctly solved, a negative correlation can be expected. In 
such a case, the poorest (i.e., slowest) individual will have the numerically 
highest score on the first test, while the best individual will hav 
score on the second. 


Correlation coefficients may be computed in various ways, depending upon 
the nature of the data. The most common is the Pearson Product-Moment 
Correlation Coefficient, Such a Correlation coefficient takes into account not 
only the individual’s position in the group, but also the 
tion above or below the group mean. 
dividual's st 


e the highest 


amount of his devia- 


It will be recalled that when each in- 
anding is expressed in terms of stand 


above the average receive positive standard scor 
average receive negative scores. Thus an individu 
variables to be correlated would h 
ferior in both would have two ne 
each individual's stand 


ard scores, persons falling 
es, while those below the 
al who is superior in both 
ave two positive standard scores; one in- 
gative standard scores. If, now, we multiply 
ard score in variable | by his standard score in varia- 
ble 2, all of these products will be positive, provided that each individual 
falls on the same side of the mean on both variables. 


coefficient is simply the mean of these products. It 
value when corresponding stand 


The Pearson correlation 
will have a high positive 
ard scores are of equal sign and of approxi- 
mately equal amount in the two variables. When subjects above the average 
in one variable are below the average in the other, the corresponding cross- 
products will be negative. If the sum of the cross-products is negative, the 


correlation will be negative. When some products are positive and some nega- 
tive, the correlation will be close to zero. 


In actual practice, it is not necessary to convert each raw score to a stand- 
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ard score before finding the cross-products, since this conversion can be made 
once for all after the cross-products have been added. There are many short- 


cuts for computing the Pearson correlation coefficient. One method is demon- 


Strated in Table 6, with hypothetical data for 10 cases. Next to each child's 


TABLE 6. Computation of Pearson Product-Moment Correlation Coefficient 


Arithmetic Reading 
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est (X) and a reading test (Y). The 


Dame are his scores in an arithmetic t : 
given under the respective columns. 


Sums and means of the 10 scores are | i i 
The third column shows the deviation (x) of each arithmetic score from the 


arithmetic mean; and the fourth column, the deviation (y) S each reading 
Score from the reading mean. These deviations are squared in the next two 
Columns, and the sums of the squares are used in computing the standard 
deviations of the arithmetic and reading scores by the method! described nim 
Chapter 4. Rather than dividing each x and y by its corresponding « to find 
Standard scores, we perform this division only once at the end, as shown in 


the correlation formula in T 
(xy) have been found by multipl 
and y columns. To compute the co 
Products is divided by the number of © 
two standard deviations (7 oy). Path 

The: corretson of 40 found in Table 6 indicates a moderate degree of 


Positive relationship between the 


able 6. The cross-products in the last column 
ying the corresponding deviations in the x 
rrelation (r), the sum of these cross- 
ases (N) and by the product of the 


arithmetic and reading scores. There is some 
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tendency for those children doing well in arithmetic also to perform well ien 
the reading test and vice versa, although the relation is not close. If we are 
concerned only with the performance of these 10 children, we cun RCCEpE 
this correlation as an adequate description of the degree of relation existing 
between the two variables in this group. In psychological research, however, 
we are usually interested in generalizing beyond the particular sample of in- 
dividuals tested to a larger population which they represent. For example, we 
might want to know whether arithmetic and reading ability are correlated 
among American school children of the same age as those we tested. Ob- 
viously the 10 cases actually examined would constitute a very inadequate 
sample of such a population. Another comparable sample of the same size 
might yield a much lower or a much higher correlation. 

There are statistical procedures for estimating the probable fluctuation to 
be expected from sample to sample in the size of correlations, means, stand- 
ard deviations, and any other group measures. The question usually asked 
about correlations, however, is simply whether the correlation is significantly 
greater than zero. In other words, if the correlation in the population is zero, 
could a correlation as high as that obtained in our sample have resulted from 
sampling error alone? When we say that a correlation is “significant at the | 
per cent (.01) level,” we mean the chances are no greater than one out of 
100 that the population correlation is zero. Hence we conclude that the two 
variables are truly correlated. Significance levels refer to the risk of error We 
are willing to take in drawing conclusions from our data. If a correlation is 
said to be significant at the .05 level, the probability of error is 5 out of 100- 
Most psychological research applies either the .01 or the .05 levels, although 
other significance levels may be employed for special reasons. 

The correlation of .40 found in Table 6 fails to reach significance even at 
the .05 level. As might have been anticipated, with only 10 cases it is diffi- 


cult to establish a general relationship conclusively. With this size of sample. 
the smallest correlation significant at the .05 level is .63. Any correlation be- 
low that value simply leaves unanswered the question of whether the two 
variables are correlated in the population from which the sample was drawn- 

The minimum correlations significant at the .01 and .05 levels for groups 
of different sizes can be found by consulting tables of the significance of cor- 
relations in any statistics textbook. For interpretive purposes in this book. 
however, only an understanding of the general concept is required. Parenthet- 
ically. it might be added that significance levels can be interpreted in a simi- 
lar way when applied to other statistical measures. For example, to say that 
the difference between two means is significant at the .01 level indicates that 


we can conclude, with only one chance out of 100 of being wrong, that a dif- 
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ference in the obtained direction would be found if we tested the whole 
population from which our samples were drawn. For instance, if in the sam- 
ple tested the boys had obtained a significantly higher mean than the girls 
on a mechanical comprehension test, we could conclude that the boys would 
also excel in the total population. 
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Score on Form l: Word Fluency Test 


Fig. 20. A Reliability Cocflicient of .72. (Data from Anastasi and Drake, 2.) 


Correlation coefficients have many uses in the analysis of psychological 


data. The measurement of test reliability represents one application of such 
Coefficients. An example of a reliability coefficient, computed by the Pearson 
Product-Moment method, is to be found in Figure 20. In this case, the scores 
Of 104 persons on two equivalent forms of a Word Fluency test! were corre- 


lated. In one form, the subjects were given five minutes to write as many 


imary Mental Abilities for Ages 11 to 17. The 


1 E Tests of P 
One of the subtests of the SRA Tes tasi and Drake (2). 


data were obtained in an investigation by Anas 
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words as they could that began with a given letter. The second form was 
identical. except that a different letter was employed. The two letters ior 
chosen by the test authors as being approximately equal in difficulty for this 
purpose. : . i 
The correlation between the number of words written in the two forms of 
this test was found to be .72. This correlation is high and significant at the 
.01 level. With 104 cases, any correlation of .25 or higher is significant at this 
level. Nevertheless, the obtained correlation is somewhat lower than is desira- 
ble for reliability coefficients, which usually fall in the .80's or .90's. An ex- 
amination of the scatter diagram in Figure 20 shows a typical bivariate dis- 
tribution of scores corresponding to a high positive correlation. It will be 
noted that the tallies cluster close to the diagonal extending from the lower 
left- to the upper right-hand corner; the trend is definitely in this direction. 
although there is a certain amount of scatter of individual entries. In the 
following sections, the use of the correlation coefficient in computing different 
measures of test reliability will be considered. 
Retest Reliability. The most obvious method for finding the reliability of 
a test is by means of a retest, or repetition of the identical test on a second 
occasion. The reliability coefficient (rji) in this case is si 


mply the correlation 
between the scores obt 


ained by the same subjects on the two administrations 
of the test. Such a reliability coefficient is known as the coefficient of stability 
(1) and corresponds to the first concept of reliability discussed in the first 
part of this chapter. 


Although apparently simple and straightforward, 


difficulties when applied to most psychological tests. Practice will probably 


produce varying amounts of improvement in the retest scores of different 
individuals. Moreover 


this technique presents 


» if the interval between retests is fairly short, the sub- 
jects may recall many of their former responses. In oth 
tern of right and wrong responses is likely to recur through sheer memory. 
Thus the scores on the two administrations of the test are not independently 
obtained and the correlation between them will be spuriously high. The na- 
ture of the test itself may also change with repetition. This is especially 
true of problems involving reasoning or ingenuity, Once the subject has 
grasped the principle involved in the problem, or once he has worked out à 
solution, he can reproduce the correct response in the future without going 
through the intervening steps. Only tests that are not appreciably affected by 
repetition lend themselves to the retest technique. A number of sensory dis- 
crimination and motor tests would fall into this Category. For the large major- 
ity of psychological tests, however, the retest technique is not suitable. 

It might be added that a coefficient of scorer reliability 


er words, the same pat- 


can usually be 
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found by correlating the scores independently obtained by two scorers with 
the same set of tests. Each subject would thus receive two scores, not from a 
retest, but from a rescoring of a single test. The limitations of the retest tech- 
nique would not apply to such a reliability coefficient. In the previously cited 
study of the Goodenough Draw-a-Man Test, for example, interscorer corre- 
lations of .87, .90, and .92 were found among three scorers ( 18). 

Examiner reliability, on the other hand, presents more complications. By 
its very nature, it involves a retest. Such a retest by different examiners 
would obviously be subject to the same limitations as retests by the same ex- 
aminer discussed above. Moreover, the correlation between the two sets of 
scores thus obtained would reflect both temporal stability and examiner reli- 
ability. Through a different type of experimental design, however, it is pos- 
sible to isolate the error variance attributable to examiners. An application of 
analysis of variance to this problem can be found in a study on visual acuity 
tests, conducted by the Pesonnel Research Branch of The Adjutant General's 
Office (23, pp. 130-131). Three of the 14 tests investigated proved to have 
significant variance attributable to examiners. 

Equivalent-Form Reliability. One way of avoiding the difficulties en- 
countered in retest reliability is through the use of equivalent forms of the 
test. The subjects can then be tested with one form on the first occasion and 
with another, comparable form on the second. The correlation between the 
Scores obtained on the two forms represents the reliability coefficient of the 
test. It will be noted that such a reliability coefficient is a measure of both 
temporal stability and consistency of response to different item samples. Such 
à coefficient thus reflects two aspects of test reliability. Since both aspects are 
important for most testing purposes. however, parallel-form reliability pro- 
Vides a useful index for evaluating many tests. If the two forms are adminis- 
lered in immediate succession or at essentially the same time, the resulting 
Correlation is designated as the coefficient of equivalence Gs The error 
Variance measured by such à coefficient reflects variations in performance 
from one specific set of items to another, but not from one occasion to an- 
Other, 


In the development of parallel forms 
to insure that they are truly parallel. Fundamentally, parallel forms of a test 


Should be independently constructed tests designed to mest the same specifi- 
Cations. The tests should contain the same number of items, and such items 
Should be expressed in the same form and should cover the same type of con- 
tent. The range and level of difficulty of the items should likewise be equal. 
Instructions, time limits, illustrative examples, format, and all other aspects 
Of the test need to be checked for comparability. Only when the two forms 


. care should of course be exercised 


120 Principles of Psychological Testing 


are actually equivalent can the differences in score from one form to the othe! 
be considered as error variance. . . 

It should be added that the availability of equivalent test forms is desira 
ble for other reasons besides the determination of test reliability. Alternate 
forms are useful in follow-up studies or in investigations of the effects ol 
some intervening experimental factor upon test performance. The use a 
several alternate forms also provides a means of reducing the possibility 
of coaching or cheating. 

Although much more widely applicable than retest reliabilit 
form reliability also has certain limitations. In the first place, if the behavior 
functions under consideration are subject to a large practice effect, the usc 
of parallel forms will reduce but not eliminate such an effect. To be sure, il 
all subjects were to show the same improvement with repetition, the correla- 
tion between their scores would remain unaffected, since adding a constant 
amount to each score does not alter the correlation coefficient. It is much 
more likely, however, that individuals will differ in amount of improvement. 
owing to extent of previous practice with similar material, motivation in tak- 
ing the test, and other factors. Under these conditions the practice effect 
represents another source of variance that will tend to reduce the correlation 
between the two test forms. If the practice effect is small, reduction will be 
negligible. 

Another related question to be co 
ture of the test will ch 
problems, for example 
solved by most subject 


y. equivalent- 


nsidered is the degree to which the na- 
ange with repetition. In certain types of ingenuity 
> any item involving the same principle can be readily 


arry-over from the first form, Finally, it 
should be added that equivalent forms are still un 


because of the practical difficulties of co 


of these reasons, some other technique 
quired. 


Split-half Reliability. From a single administration of one form of a test 
it is possible to arrive at a measure of test reliability by various split-half 
procedures. In such a Way. two scores are obtained for each individual by 
dividing the test into comparable halves, It is apparent that split-half reliabil- 
ity provides a measure of equivalence, or adequacy of item sampling. Tem- 


poral stability of the scores does not enter into Such a measure, since only one 
test session is involved. 


The first problem is how to split the test in order to obt 
comparable halves. Any test can be divided in many diff 


available for many tests. 
nstructing comparable forms, For all 
for estimating test reliability is re- 


ain the most nearly 
erent ways. In most 
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tests, the first half and the second half would not be comparable, owing to 
differences in nature and difficulty level of the items. as well as to the cumu- 
lative eflects of warming up. practice, fatigue, boredom, and any other factors 
varying progressively from the beginning to the end of the test. A precise and 
objective way of obtaining two comparable halves is to determine the diffi- 
culty level of each item on one group of persons, by finding the percentage 
who pass each item. Items are then assigned to the two halves on the basis 
of equivalent difficulty and similarity of content. The reliability is found by 
correlating the scores obtained on these two halves by a second group of per- 
sons. 

A procedure that is adequate for most purposes and much less laborious, 
however, is to find the scores on the odd and even items of the test. If the 
items were originally arranged in an approximate order of difficulty, such a 
division yields very nearly equivalent half-scores. One precaution to be ob- 
served in making such an odd-even split pertains to groups of items dealing 
with a single problem, such as questions referring to a particular mechanical 
diagram or to a given passage in a reading test. In such a case, a whole 
group of items should be assigned intact to one or the other half. Were the 
items in such a group to be placed in different halves of the test, the similarity 
of the half-scores would be spuriously inflated, since any single error in un- 
derstanding of the problem might affect items in both halves. 

Once the two half-scores have been obtained for each subject, they may 
be correlated by the usual method. It should be noted, however, that such a 
Correlation actually gives the reliability of only a half-test. For example, if 
the entire test consists of 100 items, the correlation is computed between two 
Sets of scores each of which is based on only 50 items. In both retest and 
equivalent form reliability, on the other hand, each score is based on the full 
Number of items in the test. 


Other things being equal, the longer a test, the more reliable it will be. 


It is reasonable to expect that, with a larger sample of behavior, we can ar- 


Tive at a more adequate and stable measure. The effect that lengthening or 


Shortening a test will have upon its reliability coefficient can be estimated by 
g s 


means of the Spearman-Brown formula, given below: 
nri 


n= DI ix 


in which r, is the estimated coefficient, /^;; the obtained coefficient, and n 
is the number of times the test is lengthened or shortened. Thus if the number 


Of test items is increased from 25 to 100, n is 4; if it is decreased from 60 to 


30, n is 1. The Spearman-Brown formula is widely used in determining test 
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reliability by the split-half method, many test manuals reporting reliability 
in this form. When applied to split-half reliability, the formula aways involve: 
doubling the length of the test. Under these conditions, it can be simplifiec 
e f=) 

as follows: 

2r 
| cru 

A weakness of the Spearman-Brown formula stems from its assumption 
that the variabilities of the two half-scores are equal. Such an assumption may 


not always be met, even when the half-scores appear to be comparable. A 
better procedure, which avoids this assumption, is to use the follwing for- 


mula (15): 
— ( NL a A) 
or 


are the standard deviations of the half-scores and o, the 
standard deviation of total scores on the test. It will be noted that this pro- 


a correlation coefficient. 


Fig = 


in which c, and o, 


in their coefficients of interi 
homogeneity of their items. 


teritem consistency coefficie an index of heterogeneity of 
the test items. 


tem consistency, if they differ in the degree of 
In fact, the diff 


The most common procedure for finding interitem consistency is that de- 
veloped by Kuder and Richardson. As in the split-half methods, interitem 


consistency is found from a single administration of a single test. Rather than 
requiring two half-scores, however, such a technique is based upon an ex- 
amination of performance on each item. Of the various formulas developed by 
Kuder and Rich most widely applicable: 


Fy si! n ) 07, — Xpq 


ardson, the following is the 


n—1 a, 
in which rj; is the reliability coefficient of the whole test. 7 is the number of 
items in the test, and o, the standard deviation of total scores on the test. The 
only new term in this formula, Xpq, is found by tabulating the proportion of 
persons who pass (p) and the proportion who do not Pass (q) each item. 


The product of p and q is computed for each item, and these products are 


Test Reliability 123 


then added for all items, to give Xpq. Since in the process of test construction 
p is often routinely recorded in order to find the difficulty level of each item. 
such a method of determining reliability involves little additional computation. 

It can be shown that the Kuder-Richardson reliability coefficient given 
above is actually the mean of all split-half coefficients resulting from differ- 
ent splittings of a test (6). The ordinary split-half coefficient, on the other 
hand, is based on a planned split designed to yield equivalent sets of items. 
Hence unless the test items are highly homogeneous, the Kuder-Richardson 
coefficient will be lower than the split-half reliability. Both split-half and 
Kuder-Richardson coefficients, as well as any other reliability coefficient de- 
rived from a single administration of a single form, are designated as coeffi- 
cients of internal consistency (1). As explained above, however, the infor- 
mation provided by these two kinds of coefficients is not identical. For this 
reason, it is better to specify the method whereby a particular coefficient of 
internal consistency was obtained. 

Overview. The characteristics of the most common types of reliability 
Coefficients are summarized in Table 7. The first column identifies the pro- 
cedure followed to obtain the reliability coefficient. In the second column is 
the conventional designation of each coefficient, as given in the APA Tech- 
nical Recommendations for Psychological Tests and Diagnostic Techniques 
(1). The third column indicates the factors that are treated as error variance 
by each technique. 

The correlation between retests with the same form administered on differ- 
ent occasions reflects the enduring or lasting characteristics of the individual's 


TABLE 7. Types of Reliability Coefficients 


Conventional 
Procedure Designation Error Variance 
Retest with same form on dif- Coefficient of stability Temporal fluctuaticn 
ferent occasion 
Retest with parallel form on Coefficient of stability Temporal fluctuation 
different occasion and equivalence and item specificity 
Retest with parallel form on Coefficient of equiva- Item specificity 
Same occasion lence 
Split-half (odd-even or other Coefficient of internal Item specificity 
parallel splits) consistency 
Kuder-Richardson (and other Coefficient of internal ltem specificity and 
cons;stency heterogeneity 


Measures of interitem con- 
sistency ) 
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responses. Temporary conditions, which are likely to vary from the first to 
the second occasion, lower such a correlation and hence constitute error vari- 
ance in this procedure. If such a correlation is .85, for example, it Hieans that 
15 per cent (100 — 85 — 15) of the variance of test scores is attributable to 
temporal fluctuations. The use of parallel forms introduces item specificity 
as a further source of error variance. The correlation between parallel forms 
administered on different occasions depends upon those aspects of the sub- 
ject's performance that are both lasting (over the interval covered) and gen- 
eralizable beyond the specific items included in any one form. Thus if such a 
correlation were .72, it would show that 72 per cent of the variance of test 
scores was attributable to lasting and general response characteristics. Retest 
correlations with parallel forms administered at the same time treat only item 
specificity as error variance. If this correlation were .80, for instance, we 
would conclude that 20 per cent (100 — 80 = 
was the result of item specificity. 
Split-half techniques provide the 

forms administered on the same Occasi 


20) of the score variance 


same type of information as parallel 


on. Splits such as that between odd and 
even items virtually represent equivalent half-tests administered simultane- 


ously. As in the coefficient of equivalence, coefficients of internal consistency 
based on split-half correlations treat only item specificity as error variance. 
Such correlations show how generalizable the subject's responses are, but not 
how stable in time. Finally, the Kuder-Richardson and similar techniques 
include both item specificity and item heterogeneity under error variance. 


RELIABILITY OF SPEEDED TESTS 


Internal consistency coefficients based on 


odd-even, Kuder-Richardson, or 
similar techniques are ina 


pplicable to speeded tests. To the extent that in- 
dividual differences in test scores de 


be spuriously high. An extreme 


example will help to clarify this point. Let us Suppose that a 50-item test de- 


pends entirely on speed, so that individual differences in score are based 
wholly upon number of items attempted, rather than upon errors. Then if 
individual A obtains a score of 44, he will obviously have 22 correct odd 
items and 22 correct even items. Similarly, individual B, with a score of 34. 
will have odd and even scores of 17 and 17, respectively. Consequently, ex- 
cept for accidental careless errors on a few items, the correlation between 
odd and even scores would be perfect, or -+ 1.00. Such a correlation, how- 


ever, is entirely spurious and provides no information about the reliability of 
the test. 
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An examination of the procedures followed in finding both split-half and 
Kuder-Richardson reliability will show that both are based upon the con- 
sistency in number of errors made by the subject. If. now, individual differ- 
ences in test scores depend, not on errors, but on speed, the measure of reli- 
ability must obviously be based on consistency in speed of work. To be sure, 
most psychological tests are neither pure speed nor pure power tests. but 
represent a combination of both. Under such conditions, the single-trial reli- 
ability coefficient will fall below 1.00, but it will still be spuriously high. As 
long as individual differences in test scores are appreciably affected by speed, 
single-trial reliability coefficients cannot be properly interpreted. 

What alternative procedures are available to determine the reliability of 
significantly speeded tests? If a simple repetition of the test is applicable. such 
à procedure would be appropriate. Similarly. equivalent-form reliability may 
be properly employed with speed tests. Split-half techniques may also be 
used, provided that the split is made in terms of time rather than in terms of 
items, In other words, the half-scores must be based on separately timed parts 
of the test. One way of effecting such a split is to administer two equivalent 
halves of the test with separate time limits. For example, the odd and even 
items may be separately printed on different pages, and each set of items 
given with one-half the time limit of the entire test. Such a procedure is 
tantamount to administering two equivalent forms of the test in immediate 
succession, Each form, however, is half as long as the test proper, while the 
Subjects’ scores are normally based on the whole test. For this reason, either 
the Spearman-Brown or some other appropriate formula should be used to 


find the reliability of the whole test. 
If it is not feasible to administer the two half-tests separately, an alterna- 


tive procedure is to divide the total time into quarters, and to find a score for 
each of the four quarters. This can easily be done by having the subjects 
mark the item on which they are working whenever the examiner gives a 
Prearranged signal. The number of items correctly completed within the first 
and fourth quarters can then be combined to represent one half-score, while 
those in the second and third quarters can be combined to yield the other 
half-score. Such a combination of quarters tends to balance out the cumula- 
tive effects of practice, fatigue, and other factors. This method is especially 
Satisfactory when the items are not steeply graded in difficulty level. 

When is a test appreciably speeded? Under what conditions must the spe- 
Cial precautions discussed in this section be observed? Obviously, the mere 
employment of a time limit does not signify a speed test. If all subjects finish 
Within the given time limit, speed of work plays no part in determining the 
Scores. Percentage of subjects who fail to complete the test might be taken as 
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a crude index of speed versus power. Even when no one finishes the test, 
however, the role of speed may be negligible. For example, if every subject 
completes exactly 40 items of a 50-item test, individual differences with re- 
gard to speed are entirely absent, although no one had time to attempt all 
the items. 

The essential question, of course, is: “To what extent are individual differ- 
ences in test scores attributable to speed?" 
to know what proportion of the total variance of test scores is speed variance. 
This proportion can be estimated roughly by finding the 
of items completed by different persons and dividin 
total test scores (o*,/5?,), In the example cited above, in which every indi- 
vidual finishes 40 items, the numerator of this fraction would be zero, since 
there are no individual differences in number of items 
The entire index would thus equal zero in 
hand, if the total test variance (o*,) is attrib 
speed, the two variances will be equal 
refined procedures have been develop: 


their detailed consideration falls beyo 
16, 17). 


An example of the effect of speed upon coefficients of internal consistency 
is provided by data collected in an investigation of the SRA Tests of Primary 
Mental Abilities for Ages 11 to 17 (2). In this study, the reliability of each 
test was first determined by the usual odd-even procedure. These coefficients. 
given in the first row of Table 8, are closely similar to those reported in the 
test manual. Reliability coefficients were then computed by correlating scores 


In more technical terms, we want 


variance of number 
g it by the variance of 


completed (o2, = 0). 
à pure power test. On the other 
utable to individual differences in 
and the ratio will be 1.00. Several more 
ed for determining this proportion, but 
nd the scope of this book (cf. 7, 13, 14. 


TABLE 8. Reliability Coefficients of Four of the S 


RA Tests of Primary Mental 
Abilities for Ages 11 to 17 


(Data from Anastasi and Drake, 2) 


Reliability Coefficient Verbal 

Found by: Meaning Reasoning Space Number 
Single-trial split-half method .94 .96 .90 92 
Separately timed halves 90 87 TS 83 


test is primarily a power test, while the Reasoning test is somewhat more de- 
pendent upon speed. The Space and Number tests proved to be highly 
speeded. It will be noted in Table 8 that, when Properly computed, the reli- 
ability of the Space test is .75, in contrast to a spuriously high odd-even coefti- 
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cient of .90. Similarly, the reliability of the Reasoning test drops from .96 
to .87. and that of the Number test drops from .92 to .83. The reliability of 
the relatively unspeeded Verbal Meaning test, on the other hand, shows a 
negligible difference when computed by the two methods. 


DEPENDENCE OF RELIABILITY COEFFICIENTS 
UPON THE SAMPLE TESTED 


An important factor influencing the size of a reliability coefficient is the 
nature of the group on which reliability is measured. In the first place, any 
correlation coefficient is affected by the range of individual differences in the 
group. If every member of a group were alike in spelling ability, then the 
Correlation of spelling with any other ability would be zero in that group. It 
would obviously be impossible, within such a group, to predict an individual's 
Standing in any other ability from a knowledge of his spelling score. 

Another, less extreme, example is provided by the correlation between two 
aptitude tests, such as a verbal comprehension and an arithmetic reasoning 
test. If these tests were administered to a highly homogeneous sample, such 


as a group of 300 college sophomores, the correlation between the two would 


probably be close to zero. There is little relationship, within such a selected 
sample of college students, between any individual’s verbal ability and his 
Numerical reasoning ability. On the other hand, were the tests to be given to a 
ons, ranging from institutionalized morons 


heterogeneous sample of 300 pers 
to college graduates, a high correlation would undoubtedly be obtained be- 


tween the two tests. The morons would obtain poorer scores than the college 


graduates on both tests, and similar relationships would hold for other sub- 


groups within this highly heterogencous sampl 

Examination of the hypothetical scatter diagram given in Figure 21 will 
further illustrate the dependence of correlation coefficients upon the variabil- 
ity, or extent of individual differences, within the group. This scatter diagram 
ion in the entire, heterogeneous group, since the 


about the diagonal extending from lower left- 
to upper right-hand corners. If. now, we consider only the subgroup falling 
Within the small rectangle in the upper right-hand portion of the diagram, 
it is apparent that the correlation between the two variables is close to zero. 
Within such a restricted. range. small differences in score assume much 
nining an individual's relative standing in his 


e. 


Shows a high positive correlat 
entries are closely clustered 


greater prominence in determ 
group. 

Like all correlation coefficients, reliability coefficients depend upon the 
variability of the sampling within which they are found. Thus if the reliability 
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Fig. 21. The Effect of Restricted Range upon a Correlation Coefficient. 


coeflicient reported in a test manual was determined in a 
fourth-grade children to high school students 
reliability would be equally high within, let 
When a test is to be used to discriminate 
more homogeneous sample than the standardization group, the reliability co- 
efficient should be redetermined on such a sample. Some test manuals make a 
practice of reporting separate reliability coefficients for relatively homogenc- 
ous subsamples within the standardization group (cf., e.g., 3). Formulas for 
estimating the reliability coefficient to be expected when the standard devia- 
tion of the group is increased or decreased are available in any standard sta- 
tistics textbook. It is preferable, however, to redetermine the reliability coefti- 
cient on a group comparable to that on which the test is to be used. 

Not only does the reliability coefficient vary with the extent of individual 


differences in the sample, but it may also Vary between groups differing in 
average ability level. These differences, moreover, 


group ranging from 
» it cannot be assumed that the 
us say, an eighth-grade sample. 
individual differences within a 


cannot usually be pre- 
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dicted or estimated by any statistical formula, but can be discovered only by 
empirical tryout of the test on groups differing in age or ability level. Such 
differences in the reliability of a single test may arise from the fact that a 
slightly different combination of abilities is measured at different difficulty 
levels of the test. Or it may result from the statistical properties of the type of 
score employed, as in the Stanford-Binet IQ's (cf. 19, Ch. 6). Thus for 
different ages and for different IO levels, the reliability coefficient of the 
Stanford-Binet varies from .83 to .98. In other tests, reliability may be rela- 
tively low for the younger and less able groups, since their scores are unduly 
influenced by guessing. Under such circumstances, the particular test should 


not be employed at these levels. 

It is apparent that every reliability coefficient should be accompanied by a 
full description of the type of group on which it was determined. Special at- 
tention should be given to the variability and the ability level of the sample. 
The reported reliability coefficient is applicable only to samples similar to 
that on which it was computed. A desirable and growing practice in test con- 
Struction is to fractionate the standardization sample into more homogeneous 
Subgroups, with regard to age. Sex. grade level, occupation, and the like, and 
to report separate reliability coefficients for each subgroup. Under these con- 
ditions, the reliability coefficients are more likely to be applicable to the sam- 
ples with which the test is to be used in actual practice. 


STANDARD ERROR OF MEASUREMENT 


The reliability of a test may be expressed in terms of the standard error of 
Measurement (omas), also called the standard error of a score. This measure 
is particularly well suited to the interpretation of individual scores. For many 
testing purposes, it is therefore more useful than the reliability coefficient. The 
Standard error of measurement can be easily computed from the reliability co- 


efficient of the test, by the following formula: 


=oV1—'u 


Omens. 


in which o, is the standard deviation of the test scores and ry; the reliability 
Coefficient, both computed on the same group. For example, if deviation 1Q’s 
On a particular intelligence test have a standard deviation of 15 and a re- 
liability coefficient of .90. the o,,,,. of an IO on this test is: 15/1 — .90 = 


15/10 = 15633) = 5- 

To understand what the e, 
have a series of 100 IQ's obtaine! 
ce erro 


tells us about a score, let us assume that we 
d with the same test by a single boy. Jim. 


Because of the types of chan rs discussed in this chapter, these scores 
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will vary, falling into a normal distribution around Jim's “true” score. TER 
true score is the mean of the distribution of the 100 scores and the Winia 
its standard deviation. This standard deviation may be interpreted in ternis o 
the normal curve frequencies that were discussed in Chapter 4 (see Fig. 12). 
It will be recalled that between the mean and = lo there are approximately 
68 per cent of the cases in a normal curve. Thus we can conclude that the 
chances are roughly 2:1 (or 68:32) that Jim's IQ on this test will fluctuate 
between = omens. or 5 points on either side of his true IO. If his true IO 
is 110, we would expect him to score between 105 and 115 about two-thirds 
(68 per cent) of the time. 


If we want to be more certain of our prediction, we can choose higher odds 
than 2:1. Reference to Figure 12 in Chapter 4 shows that --3« covers 99.7 
per cent of the cases. It can be ascert 


ained from normal curve frequency tables 
that a distance of 2 


-580 on either side of the mean in 
cent of the cases. Hence the chances are 99:] th 
PASE 


thus st 


cludes exactly 99 per 
at Jim's IO will fall within 
or (2.58) (5) = 13 points, on either side of his true IQ. We can 
ate at the .01 level (with only one chance of error out of 100) that 
Jim's IO on any single administration of t 


(110 — 13 and 110 + 13). If Jim were given 100 equivalent tests, his IO 
would fall outside this band of values only once. 


In actual practice, of course, we do not know the true IQ, but have only 


the IO obtained on a single test. If the reliability coefficient of the test is 
quite high and the individual's Score is 


tremes of the range, the obt 


he test will lie between 97 and 123 


not near the upper or lower ex- 
can be substituted for the true score 


IO will fall. A more precise method 


ained score 
in computing the band within which the 


efficient by the following formula (cf. 22, p.611): 


Xtrue = Pii Xon. 
in which xe». is the deviation of the obtained s 
Xiru the deviation of the true score from the 


IO was 110, his deviation from the mean of 1 
coefficient of .90, his estimated true score is fo 


core from the group mean and 
same mean. If Jim's obtained 
00 is + 10. With a reliability 
und as follows: 

Xiru = (90) (4-10) = +9 100 + 9 — 109 
Since the estimated true score falls 9 points 
estimated true score is 109. The b 
to fall at the .01 level is thus: 


above the mean IQ of 100, Jim's 
and of values within which his IQ is likely 
109 + 13, or between 96 and 122. 


The standard error of measurement and the reliability coefficient are ob- 
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viously alternative ways of expressing test reliability. Unlike the reliability 
coefficient. the error of measurement is independent of the variability of the 
group on which it is computed. Expressed in terms of individual scores, it 
remains unchanged when found in a homogeneous or a heterogeneous group. 
On the other hand, being reported in score units, the error of measurement 
may not be directly comparable from test to test. The usual problems of 
comparability of units would thus arise when errors of measurement are re- 
ported in terms of arithmetic problems, words in a vocabulary test, and the 
like. Hence if we want to compare the reliability of different tests, the re- 
liability coefficient is the better measure. To interpret individual scores, the 
standard error of measurement is more appropriate. 


INTERPRETATION OF SCORE DIFFERENCES 


It is particularly important to consider test reliability and errors of meas- 
urement when evaluating the differences between two scores. Thinking in 
terms of the range within which each score may fluctuate serves as a check 
against overemphasizing small differences between scores. Such caution is 
desirable both when comparing test scores of different persons and when 
comparing the scores of the same individual in different abilities. Similarly. 


Changes in scores following instruction or other experimental variables need 


to be interpreted in the light of errors of measurement. 

A frequent question the counselor must answer concerns the individual's 
relative standing in different areas. Is Jane more able along verbal than 
along numerical lines? Does Tom have more aptitude for mechanical than 
for verbal activities? If Jane scored higher on the verbal than on the 
numerical subtests of an aptitude battery and Tom scored higher on the 
mechanical than on the verbal, how sure can we be that they would still do 
another form of the battery? In other words, could the 
Score differences have resulted merely from the chance selection of specific 
items in the particular verbal, numerical, and mechanical tests employed? 
ng interest in the interpretation of score profiles, test 
E forms that permit the evaluation of 


50 on a retest with 


Because of the growi 


Publishers have been developing report 
Scores in terms of their errors of measurement. Outstanding examples are 


Provided by the Sequential Tests of Educational Progress (STEP) and the 
School and College Ability Tests (SCAT), both of which will be discussed 
in later chapters. Tables of norms and individual profiles for these tests are 
Constructed in terms of “percentile bands” based on the obtained score and 
its standard error. Each band covers a distance of approximately one stand- 
ard error of measurement on either side of the obtained score. Illustrations 
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of profiles plotted with such percentile bands are given in Figure 37 (Ch. 9) 
and Figure 99 (Ch. 16). In interpreting the profiles, the test user is advised 
to attach no importance to differences between scores whose percentile 
bands overlap. Ax 

Another way of handling the problem of reliability in intra-individual 
comparisons is illustrated by the Differential Aptitude Tests (DAT), one of 
whose score profiles was reproduced in the preceding chapter (Fig. 18). In 
this approach, the standard error of the difference between each pair of 
scores is computed. It can then be shown that a difference of about 10 
standard score points between any two DAT subtests will be significant at 
the .05 level.” The report form was so designed that such a difference cor- 
responded to a distance of 1 inch. Hence the test user can assume that 
differences of 1 inch or more on the profile are significant at the .05 level 
or better. 

It is well to bear in mind that the standard error of the difference between 
two scores is larger than the error of measurement of either of the two 
scores. This follows from the fact that such a difference is affected by the 
chance errors present in both scores. Moreover, when we subtract one score 
from another any common variance in the two scores cancels out, leaving 
only specific and error variance. As a result, the error variance constitutes 
a larger proportion of total variance in the difference score than it does in 
either of the two original scores. The standard error of the difference be- 


tween two scores can be found from the standard errors of measurement of 
the two scores by the following formula: 


<a ert UNE 
aiff, = Wer ican; + F meas.) 


in which c,;j, is the standard error of the difference between the two scores. 
and emeas., and Oneas.y are the standard errors of me 
scores. 


We may illustrate the application of the above procedure to two of the 
DAT subtests, (1) Verbal Reasoning and (2) Mechanical Reasoning, whose 
split-half reliabilities are .88 and .85, respectively. DAT scores are reported 
as standard scores with a mean of 50 and an SD of 10. Hence standard 


errors of the two separate scores and of the differences between them are as 
follows: 


asurement of the separate 


2 Incorrectly reported as the .01 level in the test manual. 
3 By substituting SD 4/1 — r*, for Cmeas, and SD V1 — “01 for Smear. We may rewrite 
the formula directly in terms of reliability coefficients, a : 


+ since their scores would normally 
any comparisons between them. 
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Owens, = O/T — BE = 346 na = 10V/T — 85 — 3.87 
ru p SIE 
on, = VOAY + B87)? = 5.20 


It will be noted that the standard error of the difference is considerably 
larger than the standard errors of the two separate scores. To determine 
how large a score difference could be obtained by chance at the .05 level, 
we multiply the standard error of the difference (5.20) by 1.96. The result 


is approximately 10. Thus the difference between an individual's Verbal and 


Mechanical scores must be 10 points or greater to be significant at the .05 


level, 
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CHAPTER 6 


Methods for Determining Validity 


As indicated in Chapter 2. the question of test validity concerns what 
the test measures and how well it does so. In this connection, we should 
guard against accepting the test name as an index of what the test measures. 
Test names provide short. convenient labels for identification purposes. 
Most test names are far too broad and vague to furnish meaningful clues to 


the behavior area covered, although a trend toward the use of more specific 


and operationally definable test names is discernible. The trait measured by a 
given test can be defined only through an examination of the specific criteria 


9r other objective sources of information utilized in establishing its validity 


(2). Moreover, the validity of 
No test can be said to have “high” o 
validity must be determined with reference to the particular use for which 
It is being considered. 

Fundamentally, all procedures for deter 
With the relationships between performance 
ently observable facts about the behavior 
The specific techniques employed for investigating these relationships are 
Numerous and have been described by various names. The APA Technical 
Recommendations (1) classified. these procedures under four categories, 
designated as content, predictive. concurrent, and construct validity. Each of 
these types of validation procedures will be considered in one of the follow- 
ing sections, and the relations among them will be examined in a concluding 
Section. The utilization of validity data in making practical decisions will be 


discussed in Chapter 7. 


a test cannot be reported in general terms. 
r "low" validity in the abstract. Its 


mining test validity are concerned 
e on the test and other independ- 
characteristic under consideration. 


CONTENT VALIDITY 


Content validity involves essentially the systematic examination of the 
lest content to determine whether it covers a representative sample of the 


13 


A 
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behavior domain to be measured. Such a validation procedure is commonly 
used in evaluating achievement tests. This type of test, it will be recalled, is 
designed to measure how well the individual has mastered a specific skill or 
course of study. It might thus appear that mere inspection of the content of 
the test should suffice to establish its validity for such a purpose. A test of 
multiplication, spelling, or American history would seem to be valid by defini- 
tion if it consists of multiplication, spelling, or American history items, re- 
spectively. 

The solution, however, is not as simple as it appears to be. One difficulty 
is presented by the problem of content sampling. The content area to be 
tested must be systematically analyzed to make certain that all major aspects 
are adequately covered by the test items, and in the correct proportions. For 
example, a test can easily become overloaded with those aspects of the field 
that lend themselves more readily to the preparation of objective items. The 
content area under consideration needs to be fully described in advance. 
rather than being defined after the test has been prepared. A well-constructed 
achievement test should cover the objectives of instruction, not just its sub- 
ject matter (9, 11). Content thus needs to be broadly defined to include 
major objectives, such as the application of principles and the interpretation 
of data, as well as factual knowledge. Moreover, content validity depends 
upon the relevance of the individual's test responses to the behavior area 
under consideration, rather than upon the apparent relevance of item con- 
tent (9, 14). Mere inspection of the test may fail to reveal the processes 
actually used by subjects in taking the test. 

It is also important to guard against any tendency to overgeneralize re- 
garding the content sampled by the test. For instance, a multiple-choice 
spelling test may measure the ability to recognize correctly and incorrectly 
spelled words. But it cannot be assumed that such a test also measures ability 
to spell correctly from dictation, frequency of misspellings in written composi- 
tions, and other aspects of “spelling ability” (cf. 7, 13). Still another diffi- 
culty arises from the possible inclusion of irrelevant factors in the test scores. 
For example, a test designed to measure the effects of instruction in such 
areas as mathematics or mechanics may be unduly influenced by the ability 
to understand verbal directions or by speed of performing simple, routine 
tasks. 

A number of empirical procedures can be followed to check on the con- 
tent validity of an achievement test (9, 10, 14). When available, parallel 
forms of the test can be administered before and after a relevant course of 
study to see whether there is an appreciable improvement in scores, Other 
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procedures include a study of the types of errors commonly made on the test 
and an analysis of the work methods employed by subjects, possibly by 
giving the test individually to subjects with the instructions to “think aloud” 
in solving each problem. The contribution of speed can be checked by noting 
how many subjects fail to finish the test or by one of the more refined 
methods discussed in Chapter 5. To detect the possible irrelevant influence 
of ability to read instructions upon test performance, scores on the test can 
be correlated with scores on a reading comprehension test. On the other 
hand, if the test is designed to measure reading comprehension, giving the 
questions without the reading passage on which they are based will show 
how many could be answered simply from the subjects’ prior information 
or other irrelevant cues. 

Content. validity, especially when bolstered by such empirical checks as 
those illustrated above, provides an adequate technique for evaluating 
achievement tests. For aptitude and personality tests, however, content valid- 
ity is not sufficient and may, in fact, be misleading. Although considerations 
of appropriateness and effectiveness of content must obviously enter into the 
initial stages of constructing such tests, eventual validation of the test re- 
quires thorough empirical verification by the procedures to be described in 
the following sections. Aptitude and personality tests bear less intrinsic re- 
semblance to the behavior domain they are trying to sample than do achieve- 
Ment tests. Consequently the content of aptitude and personality tests can do 
little more than reveal the hypotheses that led the test constructor to choose 
a certain type of content for measuring à specified trait. Such hypotheses 
need to be empirically confirmed to establish the validity of the test. 

Unlike achievement tests, aptitude and personality tests are not based on a 
Specified course of instruction or uniform set of prior experiences from which 
test content can be drawn. Hence in the latter tests individuals are likely to 
vary more in the work methods or psychological processes employed in re- 
Sponding to the same test items. The identical test might thus measure differ- 
ent functions in different persons. Under these conditions, it would be virtu- 


ally impossible to determine the psychological functions measured by the 


test from an inspection of its content. For example, college graduates might 


Solve a problem in verbal Or mathematical terms, while a mechanic would 
arrive at the same solution in 
uring arithmetic reasoning among high school freshmen might measure only 
individual differences in speed of computation when given to college stu- 
dents. A specific illustration of the dangers of relying upon content analysis 
Of aptitude tests is provided by a study conducted with a digit-symbol sub- 


terms of spatial visualization. Or a test meas- 
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stitution test (4). This test, generally regarded Hsc typical E p bi 
test, was found to measure chiefly motor speed in a group of high sc 
"o validity should not be confused with face validity. The jede 
not validity in the technical sense; it refers, not to what the test actual y 
measures, but to what it appears superficially to measure, Face validity, Lae 
tains to whether the test "looks valid" to the subjects who take it, the ad- 
ministrative personnel who decide upon its use, and other technically nw 
trained observers. Fundamentally, the question of face validity See 
rapport and public relations. Although common usage of the tert validity 
in this connection may make for confusion, face validity itself is a desirable 
feature of tests. Certainly if test content appears irrelevant, inappropriate. 
silly, or childish, the result will be poor cooperation, regardless of the actual 
validity of the test. When tests originally designed for children and developed 
within a classroom setting were first extended for adult use, they frequently 
met with resistance and criticism because of their lack of face validity. 
Especially in adult testing, it is not sufficient for a test to be objectively 
valid. It also needs face validity to function effectively in practical situations. 
Face validity can often be improved by merely reformulating test items 
in terms that appear relevant and plausible in the particular setting in which 
they will be used. For example, if a test of simple arithmetic reasoning 15 
constructed for use with machinists, the items should be worded in terms of 
machine operations rather than in terms of “how many oranges can be pur- 
chased for 36 cents" or other traditional schoolbook problems. Similarly. 
an arithmetic test for naval personnel can be expressed in naval terminology: 
without necessarily altering the functions measured. To be sure, face validity 


should never be regarded as a substitute for objectively determined validity. 
It cannot be assumed that improving the face validity of a test will improve 
its objective validity. Nor can it be assumed that when a test is modified so 
as to increase its face validity, its objective validity remains unaltered. The 
validity of the test in its final form will always need to be directly checked. 


PREDICTIVE VALIDITY 


Predictive validity indicates the effectiveness of a test in predicting some 
future outcome. For this purpose, test scores are checked against a direct 
measure of the subjects’ subsequent performance, technically known as the 
criterion. This type of validity information is most relevant for tests used in 


the selection and classification of personnel. Hiring job applicants, selecting 
students for admission to college or professional schools, and assigning €n- 
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listed men to different military specialties represent examples of the sort of 
decisions requiring a knowledge of the predictive validity of tests. Other ex- 
amples include the use of tests to screen out men likely to develop emotional 
disorders under military stress and the use of tests to identify psychiatric 
patients most likely to benefit from a particular therapy. 

In all these instances, the validation. procedure is essentially similar. A 
representative sample of the population under consideration is given the 
test, but the scores are not used to make any decisions regarding this sample. 
In fact, scores should not be accessible to anyone in a position to make deci- 
sions that influence outcomes for these subjects. As explained in Chapter 2; 
d any possible criterion contamina- 


after criterion data have become 
om the test scores agree with 


this precaution is required in order to avoi 
tion. Follow-up of the original sample 
available will show how closely predictions fr 


Observed outcomes. 
Predictive validity against various criteria is commonly reported in test 


manuals to aid the potential user in understanding what a test measures. Al- 
though he may not be directly concerned with the prediction of any of the 
Specific criteria employed, by examining such criteria the test user is able to 
build up a concept of the behavior domain sampled by the test. The criteria 
most frequently cited in test manuals include general academic achievement. 
performance in specialized training, and on-the-job performance. 

General Academic Achievement. Probably the most common criterion 
employed in validating intelligence tests i 
ment, It is for this reason that such tests have often been more precisely 
described as measures of scholastic aptitude. The specific indices used as 
criterion measures include school grades, achievement test scores, promotion 
and graduation records, special honors and awards, and teachers’ or instruc- 
tors’ ratings for “intelligence.” In sO far as such ratings given within an 
academic setting are likely to be heavily colored by the iadividual's scholastic 
Performance, they may be properly classified with the criterion of academic 
achievement. 


The various indices of academic 
at all educational levels. from the primary grades to college and graduate 


School. Although employed principally in the validation of general intelli- 
Bence tests, they have also served as criteria for certain multiple aptitude 


and personality tests. In t 


in the selection of college students, for ex 
rage. This measure is the average grade in all courses 


de being weighted by the number 


s some index of academic achieve- 


achievement have provided criterion data 


he validation of any of these types of tests for use 
ample. a common criterion is fresh- 


Man grade-point ave 
taken during the freshman year, each gra 
Of course points for which it was received. 
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Performance in Specialized Training. In the development of special apti- 
tude tests, a frequent type of criterion is based upon performance in a course 
of specialized training. For example, mechanical aptitude tests may be 
validated against final achievement in shop courses. Various business school 
courses, such as stenography, typing, or bookkeeping, provide criteria for 
aptitude tests in these areas. Similarly, performance in music or art schools 
has been employed in validating music or art aptitude tests, respectively. 
Several professional aptitude tests have been validated in terms of achieve- 
ment in schools of law, medicine, dentistry, engineering, and other areas. In 
the case of custom-made tests, designed for use within a Specific testing 
program, training records are a frequent source of criterion data. An out- 
standing illustration is the validation of Air Force pilot selection tests against 
performance in pilot training. 

Among the specific indices of training performance employed for criterion 
purposes may be mentioned achievement tests administered u 
of training, formally assigned grades, instructors" ratings, 
completion of training versus elimination from the program. Multiple aptitude 
batteries have often been checked against grades in specific high school or 
college courses, in order to determine their validity as differe 
For example, scores on a verbal comprehension test may be 


grades in English courses, spatial visualization scores with ge 
and so forth. 


In connection with the use of tr 
ures, a useful distinction is that 


pon completion 
and successful 


ntial predictors. 
compared with 
ometry grades, 


cticing physician, respectively. Obviously 
it would require a long time for such criterion data to mature, It is doubtful, 
moreover, whether a truly ultimate criterion ise ained in actual prac- 
tice. Finally, even were such an ultimate criterion available, it would proba- 
bly be subject to many uncontrolled factors that would render it relatively 
useless. For example, it would be difficult to evaluate the relative degree of 
success of physicians practicing different specialties and in different parts of 
the country. For these reasons, such intermediate Criteria as performance 
records at some stage of training are frequently employed as criterion meas- 
ures. 

On-the-Job Performance. In many ways, the most satisfactory type of 
criterion measure is that based upon follow-up records of actual job perform- 
ance. Such a criterion has been used to some extent in the validation of 
general intelligence as well as personality tests, and to a large extent in the 
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validation of special aptitude tests. It is a common criterion in the validation 
of custom-made tests for specific jobs. The "jobs" in question may vary 
widely in both level and kind, including work in business, industry, the 
professions, the armed forces, and any other field. Most measures of job 
performance, although probably not representing ultimate criteria, at least 
provide good intermediate criteria for many testing purposes. In this respect 
they are to be preferred to training records. On the other hand, the measure- 
ment of job performance does not permit as much uniformity of conditions 
as is possible during training. Moreover, since it usually involves a longer 
follow-up, the criterion of job performance is likely to entail a loss in the 
number of available subjects. 

A wide variety of indices may be employed to measure degree of job suc- 
cess. Among them may be mentioned quantity and quality of output, acci- 
dents and loss through breakage. salary and commissions, job stability and 
length of service, rate of advancement, and merit ratings by supervisors. The 
criterion measure may be based on the observation of a limited sample of 
the individual's job performance, such as a worksample, sales interview, or 
pilot check flight. Or it may be derived from a cumulative record of output, 


sales, merit ratings, or job history covering the total available period on the 


job (cf. 17, pp. 132-159). 


CONCURRENT VALIDITY 


The relation between test scores and indices of criterion status obtained at 
approximately the same time is known as concurrent Yalidity. In a number 
of instances, concurrent validity is found merely as a substitute for predictive 
validity. It is frequently impracticable to extend validation procedures over 
the time required for predictive validity or te obtain a suitable preselection 
sample for testing purposes. As a compromise solution, therefore, tests are 
administered to a group on whom criterion data are already available. Thus 
the test scores of college students may be compared with their cumulative 
grade-point average at the time of testing, or those of employees compared 
with their current job success. 

A variant of the criterion of academic achievement frequently employed 
with out-of-school adults is the amount of education the individual completed. 
It is expected that in general the more intelligent individuals continue their 
education longer, while the less intelligent drop out of school earlier. The 
assumption underlying this criterion is that, the educational ladder serves as 
a progressively selective influence, eliminating those incapable of continuing 
beyond each step. Although it is undoubtedly true that college graduates, 
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for example, represent a more highly selected group than elementary school 
graduates, the relation between amount of education and scholastic aptitude 
is far from perfect. Especially at the higher educational levels, economic. 
social, motivational, and other non-intellectual factors may influence the 
continuation of the individual's education. Moreover, with such concurrent 
validation, it is difficult to disentangle cause-and-effect relations. To what 
extent are the obtained differences in intelligence test scores simply the result 
of the varying amount of education? And to what extent could the test have 
predicted individual differences in subsequent educational progress? These 
questions can be answered only when the test is administered before the 
criterion data have matured, as in predictive validation. 

Similarly, when current test scores are compared with the grades earned 
by college students or the job proficiency of employees, we cannot determine 
how far the test scores reflect what the students have learned in college or 
what the workers have learned on the job. Furthermore, when employees or 
college students are given tests "for research purposes only," their motiva- 
tion and test-taking attitudes may be quite unlike those of job applicants or 
students seeking college admission. It is apparent that to generalize from 
concurrent to predictive validity is a questionable procedure. 

For certain uses of psychological tests, on the other hand, concurrent 
validity is the most appropriate type and can be justified in its own right. 
The logical distinction between predictive and concurrent validity is based. 
not on time, but on the objectives of testing. Concurrent validity is relevant 
to tests employed for diagnosis of existing status, rather than prediction of 
future outcomes. The difference can be illustrated by asking: "Is Smith 
neurotic?" (concurrent validity) and “Is Smith likely to become neurotic?" 
(predictive validity ) . 

Since the criterion for concurrent validity is always avail 
of testing, we might ask what function is served by the test in such situations. 
Basically, such tests provide a simpler, quicker, or less expensive substitute 
for the criterion data. For example, if the criterion consists of continuous 
observation of a patient during a two-week hospitalization period, a test that 
could sort out normals from neurotic and doubtful cases would appreciably 
reduce the number of persons requiring such extensive Observation. 

Test manuals frequently contain data on concurrent validity, either as a 
substitute for predictive validity or as evidence of the diagnostic power of 
the test. Although especially suitable for Personality tests, 
validity is also reported for many ability tests. Among the n 


criteria employed for concurrent validation are contrasted 
and other tests. 


able at the time 


this type of 
host common 
groups, ratings. 
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Contrasted Groups. Validation by the method of contrasted groups gen- 
erally involves a composite criterion that reflects the cumulative and uncon- 
trolled selective influences of everyday life. This criterion is ultimately based 
upon survival within a particular group versus elimination therefrom. For 
example, in the validation of an intelligence test. the scores obtained by 
institutionalized mental defectives may be compared with those obtained by 
school children of the same age. In this case, the multiplicity of factors de- 
termining commitment to an institution for the feebleminded constitutes the 
criterion. Similarly. the validity of a musical aptitude or a mechanical apti- 
tude test may be checked by comparing the scores obtained by students 
enrolled in a music school or an engineering school, respectively, with the 
scores of unselected high school or college students. 

To be sure, contrasted groups can be selected on the basis of any criterion. 
such as school grades. ratings. OT job performance, by simply choosing the 
extremes of the distribution of criterion measures. The contrasted groups in- 
cluded in the present category. however. are distinct groups that have gradu- 
ally become differentiated through the operation of the multiple demands of 
daily living. The criterion 
clearly definable than those previously discussed. 

The method of contrasted groups is used quite commonly in the validation 
of personality tests. Thus in validating a test of social traits, the test perform- 
ance of salesmen or executives, on the one hand, may be compared with 
that of clerks or engineers, on the other. The assumption underlying such a 
procedure is that, with reference to many social traits, individuals who have 
entered and remained in such occupations às selling or executive work will 
fields as clerical work or engineering. Simi- 
engaged in many extracurricular activities 


under consideration is thus more complex and less 


às a group excel persons in such 
larly, college students who have 
may be compared with those who have participated in none during a com- 


parable period of college attendance. Occupational groups have frequently 
been used in the development and validation of interest tests, such as the 
Strong Interest Test. as well as in the preparation of attitude scales. Other 
ployed in the validation of attitude scales include politi- 
cal, religious, geographical, or other special BIOUps generally known to rep- 
resent distinctly different points of view on certain issues. 

A number of personality tests concerned with the measurement of emo- 
tional or social adjustment are validated on such groups as institutionalized 
delinquents versus non-delinquents, or on neurotics versus normals. During 
World War II, for example. comparisons were made between the scores ob- 
tained on certain personality tests by the general selectee population and 
the scores obtained by individuals discharged from service because of neuro- 


groups sometimes em 
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psychiatric disability. The criterion in such ü case is inability to remain in 
military service because of personality difficulties. A 

In the development of certain personality tests, psychiatric diagnosis is used 
both as a basis for the selection of items and as evidence of test validity. 
Psychiatric diagnosis may serve as a satisfactory criterion provided that it is 
based upon prolonged observation and detailed case history, rather than 
upon a cursory psychiatric interview or examination. In the latter case, there 
is no reason to expect the psychiatric diagnosis to be superior to the test 
score itself as an indication of the individual’s emotional condition. Such a 
psychiatric diagnosis could not be regarded as a criterion measure, but rather 
as an indicator or predictor whose own validity would have to be determined. 

Ratings. Mention has already been made, in connection with other criterion 
categories, of certain types of ratings by school teachers, instructors in spe- 
cialized courses, and job supervisors. To these can be added ratings by offi- 
cers in military situations, ratings of students by school counselors, and ratings 
by co-workers, classmates, fraternity brothers, sorority sisters, and other 
groups of associates. The ratings discussed in earlier sections represented 
merely a subsidiary technique for obtaining information regarding such cri- 
teria as academic achievement, performance in specialized training, or job 
success. In this section, however, we are concerned with the use of ratings as 
the very core of the criterion measure. Under these circumstances, the rat- 
ings themselves define the criterion. Moreover, such ratings are not restricted 
to the evaluation of specific achievement, but involve a personal judgment 
by an observer regarding any of the variety of traits that psychological tests 
attempt to measure. Thus the subjects in the validation sample might be 
rated on such characteristics as dominance, mechanical ingenuity, 
leadership, or honesty. 

Ratings have been employed in the validation of almost every type of test. 
They are particularly useful in providing criteria for personality tests, since 
objective criteria are much more difficult to find in this area. Especially is 
this true of distinctly social traits, in which ratings based upon personal con- 
tact may constitute the most logically defensible criterion. 

If ratings are obtained from trained raters under carefully controlled con- 
ditions, they can provide a valuable source of criterion data. It is generally 
desirable to secure independent ratings from more than one Observer, 
order to rule out individual bias and idiosyncrasy of the rater. The accuracy 
of ratings can be greatly increased by the use of well-constructed rating scales 
with clearly defined, unambiguous units and with adequate safeguards against 
common rating errors, such as the “halo effect" (the tendency on the part of 
raters to be unduly influenced by a single favorable or unfavorable trait, 


originality, 


in 
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which thus colors their judgment of the individual's other traits). Finally, it 
is essential that raters have "trait acquaintance" with the individual in the 
traits they are rating. In other words, it is not enough to have known the 
individual for a long time. The rater should have had the opportunity to 
observe the individual in situations in which the particular trait in question 
was manifested. Raters should not rate an individual on traits for which they 
lack adequate trait acquaintance. 

Ratings are also involved in criterion measures based upon clinical evalua- 
tions. In such cases, of course, the clinicians judgment may be aided by 
other supporting information, such as a detailed case history. test scores, and 
other objective records. The final evaluation, however, depends upon judg- 
ment, as in all ratings. Clinical evaluations are being used increasingly in 
the validation of personality tests. An example is provided by a project con- 
ducted at the USAF School of Aviation Medicine (16). In this project, 
clinical evaluations by staff psychologists constituted an important part of 
the criterion employed in investigating the validity of a series of personality 


screening tests for pilot cadets. 
Correlations with Other Tests. Correlations between a new test and pre- 


viously available tests are frequently cited as evidence of validity. When the 
new test is an abbreviated or simplified form of a currently available test, 
the latter can properly be regarded as a criterion measure. Thus a paper- 
and-pencil test might be validated against a more elaborate and time-con- 
hose validity had previously been established. Or 
ated against an individual test. The Stanford- 
Binet, for example, has repeatedly served as a criterion in validating group 
tests. In such a case, the new test may be regarded at best as a crude ap- 
proximation of the earlier one. It should be noted that unless the new test 
represents a simpler or shorter substitute for the earlier test, the use of the 


latter as a criterion is indefensible. 


suming performance test W 
a group test might be valid 


CONSTRUCT VALIDITY 


The construct validity of a test is the extent to which the test may be 
“theoretical construct” or trait. Examples of such con- 
structs are intelligence, mechanical comprehension, verbal fluency, speed of 
walking, neuroticism, and anxiety. Focusing on a broader, more enduring, 
and more abstract kind of behavioral description than the previously dis- 
uct validation requires the gradual accumula- 


said to measure a 


cussed types of validity, constr 
tion of information from a varie 
nature of the trait under considerat 


ty of sources. Any data throwing light on the 
ion and the conditions affecting its de- 
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velopment and manifestations are grist for this validity mill. As illustrations 
of the specific techniques utilized may be mentioned age differentiation, cor- 
relations with other tests, factor analysis, internal consistency, and effect of 
experimental variables on test scores. Y 
Age Differentiation. A major criterion employed in the validation of a 
number of intelligence tests is age. Such tests as the Stanford-Binet and most 
preschool tests are checked against chronological age to determine whether 
the scores show a progressive increase with advancing age. Since abilities 
are expected to increase with age during childhood, it is argued that the test 
scores should likewise show such an increase, if the test is valid. The very 
concept of an age scale of intelligence, as initiated by Binet, is based upon 
the assumption that "intelligence" increases with age, at least until maturity. 
The criterion of age diflerentiation, of course, 
tions that do not exhibit clear-cut and consistent age changes. In the area of 
personality measurement, for example, it has found limited use. Moreover, it 
should be noted that age differentiation is essentially a negative rather than a 
positive criterion. Thus if the test scores fail to improve with age, such a 
finding probably indicates that the test is not a valid measure of the abilities 
it was designed to sample. On the other hand, to prove that a test measures 
something that increases with age does not define the area covered by the 
lest very precisely. A measure of height or weight would also show regular 


age increments, although it would obviously not be designated as an intelli- 
gence test. 


is inapplicable to any func- 


A final point should be emphasized regarding the interpretation of the 
criterion. A psychological test validated a 
havior characteristics th 


age 
gainst such a criterion measures be- 
at increase with age under the conditions existing in 
the type of environment in which the test was standardized. Since different 
cultures may stimulate and foster the development of dissimil 
characteristics, it cannot be assumed that the criterion of 
is a universal one. Like all other criteria 
cultural setting in which it is derived. 
Correlations with Other Tests. Correlations between a new test 
earlier tests are sometimes cited as evidence that the new test measures ap- 
proximately the same general area of behavior as other tests designated by 
the same name, such as “intelligence tests” or “mechanical aptitude tests." 
Unlike the correlations found in concurrent validity, these correlations should 
be moderately high, but not too high. If the new test correlates too highly 
with an already available test, without such added advantages as brevity or 
ease of administration, then the new test Tepresents needless duplication. 
Correlations with other tests are employed in still another way to demon- 


ar behavior 
age differentiation 
, it is circumscribed by the particular 


and similar 
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strate that the new test is relatively free from the influence of certain ir- 
relevant factors. For example, a special aptitude test or a personality test 
should have a negligible correlation with tests of general intelligence or 
scholastic aptitude. Similarly, reading comprehension should not appreciably 
affect performance on such tests. Thus correlations with tests of general in- 
telligence, reading, or verbal comprehension are sometimes reported as in- 
direct or negative evidence of validity. In these cases, high correlations would 
make the test suspect. Low correlations, however, would not in themselves 
insure validity. It will be noted that this use of correlations with other tests 
is similar to one of the techniques described under content validity. 

Factor Analysis. Of particular relevance to construct validity is factor 
analysis, a statistical procedure for the identification of psychological traits. 
Essentially, factor analysis is a refined technique for analyzing the interrela- 
tionships of behavior data. For example, if 20 tests have been given to 300 
persons, the first step is to compute the correlations of each test with every 
other. An inspection of the resulting table of 190 correlations may itself re- 
veal certain clusters among the tests. suggesting the location of common 
traits. Thus if such tests as vocabulary, analogies, opposites, and sentence 
completion have high correlations with each other and low correlations with 
all other tests, we could tentatively infer the presence of a verbal comprehen- 
sion factor. Since such an inspectional analysis of a correlation table is diffi- 
cult and uncertain, however, more precise statistical techniques have been 
developed to locate the common factors required to account for the obtained 
correlations. These techniques of factor analysis will be examined further in 
Chapter 13, together with multiple aptitude tests developed by means of 
factor analysis. The application of factor analysis to the construction. of 
personality tests will be illustrated in Chapter 18. ) 

In the process of factor analysis, the number of variables or categories in 
terms of which each individual's performance can be described is reduced 
r of original tests to à relatively small number of factors, or 


above. five or six factors might suffice 


from the numbe 


common traits. In the example cited : 
to account for the intercorrelations among the 20 tests. Each individual 


; , A is s in the five or six factors, rather 
might thus be described in terms of his scores in the or six factors; ether 
than in terms of the original 20 scores. A major purpose of factor analysis 
is to simplify the description of behavior by reducing the number of cate- 
sares from aus initial multiplicity of test variables to a few common factors, 


or traits. 

After the factors have been 
the factorial composition of a test. 
terms of the major factors determini 


identified, they can be utilized in describing 
Each test can thus be characterized in 
ng its scores, together with the weight or 
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loading of each factor. Such factor loadings also represent the correlations 
of the test with each factor, a correlation known as the factorial validity of 
the test. Thus if the verbal comprehension factor has a weight of .66 in a 
vocabulary test, the factorial validity of this vocabulary test as a measure of 
the trait of verbal comprehension is .66. It should be noted that factorial 
validity is essentially the correlation of the test with whatever is common to 
a group of tests or other indices of behavior. The set of variables analyzed 
can, of course, include both test and non-test data. Ratings and other criterion 
measures can thus be utilized, along with other tests, to explore the factorial 
validity of a particular test and to define the common traits it measures. 

Internal Consistency. In the published descriptions of certain 
in the area of personality, the statement is made that the test 
dated by the method of internal consistency. The essential characteristic of 
this method is that the criterion is none other than the total Score on the test 
itself. Sometimes an adaptation of the contrasted group method is used, ex- 
treme groups being selected on the basis of the total test score. The perform- 
ance of the upper criterion group on each test item is then compared with 
that of the lower criterion group. Items that fail to show a significantly 
greater proportion of “passes” in the upper than in the lower criterion group 
are considered invalid, and are either eliminated or revised. Correlational 
procedures may also be employed for this purpose. For example, the biserial 
correlation between “pass-fail” on each item and total test score can be 
computed. Only those items yielding significant item-test Correlations would 
be retained. A test whose items were selected by this method can be said to 
show internal consistency, since each item differentiates in the 
as the entire test. 

Another application of the criterion of internal Consistency involves the 
correlation of subtest scores with total score. Many intelligence tests, for 
instance, consist of Separately administered subtests (such as vocabulary. 
arithmetic, picture completion, etc.) whose scores are combined in finding 
the total test score. In the Construction of such tests. the Scores on each 
subtest are often correlated with total score and any subtest whose correla- 
tion with total score is too low is eliminated. The correlations of the remain- 
ing subtests with total score are then reported as evidence of the internal 
consistency of the entire instrument. 

It is apparent that internal consistency correl 
items or subtests, are essentially measures of hom 
characterize the behavior domain or trait sampled 
homogeneity of a test has some relevance to its co 
less, the contribution of internal consistency data 


tests, especially 
has been vali- 


same direction 
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limited. In the absence of data external to the test itself, little can be learned 
about what a test measures. 

Effect of Experimental Variables on Test Scores. A further source of data 
for construct validation is provided by experiments on the effect of selected 
variables on test scores. Whether pitch discrimination as measured by a par- 
ticular test is or is not susceptible to practice, for instance, can be checked 
by administering the test to the same subjects before and after a period of 
intensive practice. A test designed to measure anxiety-proneness can be ad- 
ministered to subjects who are subsequently put through a situation designed 
to arouse anxiety, such as taking an examination under distracting and stress- 
ful conditions. The initial anxiety test scores can then be correlated with 
physiological and other indices of anxiety expression during and after the 
examination. 

Except for the artificial introduction of an anxiety-provoking variable, the 
procedure followed in the above experiment is similar to that used in estab- 
lishing predictive validity. The direct measures of anxiety expression will be 
recognized as the criterion against which the predictions made from test 
scores are validated. A different hypothesis regarding an anxiety test could 
be evaluated by administering the test before and after an anxiety-arousing 
experience and seeing whether test scores rise significantly on the retest. 
Positive findings from such an experiment would indicate that the test scores 
reflect current anxiety level. In a similar way. experiments can be designed 
to test any other hypothesis regarding the trait measured by a given test. 


A COMPREHENSIVE VIEW 


We have considered four major ways of asking. "How valid is this test?" 
To highlight the distinctive features of these four types of validity, let us 
apply each in turn to a test consisting of fifty assorted arithmetic problems. 


This test might be used: 


@ Asan achievement test in elementary school arithmetic (content validity) : 


How much has Dick learned in the past? 


@ Asan aptitude test to predict performance in high school mathematics (predic- 


tive validity) : ] ; ; 
How well will Jim learn in the future: 
sing brain damage (concurrent validity) : 


@ Asa technique for diagno ; 
brain-damaged or in the normal group? 


Does Bill belong in the 


Q Asa measure of logical reasoning (construct validity) : 
How can we describe Henry's psychological functioning? 
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Lest the above example should make validity appear clear and simple, let 
us hasten to inject some disturbing thoughts. The four types of validity are 
not distinct and logically coordinate. Construct validity is a comprehensive 
concept, which includes the other three types. All the specific techniques for 
establishing content, predictive, and concurrent validity, discussed in earlier 
sections of this chapter, could have been listed again under construct validity. 
Comparing the test performance of contrasted groups, such as neurotics and 
normals, is one way of checking the construct validity of a test designed to 
measure emotional adjustment, anxiety, or other postulated traits. Compar- 
ing the test scores of institutionalized mental defectives with those of normal 
school children is one way to investigate the construct validity of an intelli- 
gence test. The correlations of a mechanical aptitude test with performance 
in shop courses and in a wide variety of jobs contribute to our understanding 
of the construct measured by the test. 

Content validity likewise enters into both the construction and the subse- 
quent evaluation of all tests. In assembling items for any new test, the test 
constructor is guided by hypotheses regarding the relations between the type 
of content he chooses and the behavior he wishes to measure. Empirical and 
concurrent validation, as well as the other techniques discussed under con- 
Struct validation, represent ways of testing such hypotheses. As for the test 
user, he too relies in part on content validity in evaluating any test. For ex- 
ample, he may check the vocabulary in an emotional adjustment inventory 
to determine whether some of the words are too difficult for the subjects he 
plans to test; he may choose a non-verbal rather than a verbal test of intelli- 
gence for examining children with reading disabilities; he may conclude that 
the scores on a particular test depend too much on speed for his purposes; or 


he may notice that an intelligence test developed twenty years 


ago contains 
many obsolescent items unsuit 


able for use today. All these observations about 
content are relevant to the construct validity of a test. 


information provided by any valid 
construct validity. 

Following its introduction in the Technical Recommendations in 1954, the 
concept of construct validity was subjected to lively discussion. In a number 
of provocative articles, construct validity was extensively elaborated (8), 
favorably reviewed (6), diligently illustrated (12), vigorously attacked (3), 
partially redefined (15), and incisively sharpened (5). The basic idea of con- 
struct validity is not new. The use of theoretical constructs or trait categories 
is as old as psychological testing. Some of the earliest tests were designed to 
measure such constructs as attention, memory, and association. Nor should 
we forget that most notorious of all theoretical constructs, 


In fact, there is no 
ation procedure that is not relevant to 


"intelligence." 
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Similarly, none of the validation techniques specially identified with construct 
validity is new. Test manuals had been reporting data on age differentiation, 
correlations with other tests, factorial validity, internal consistency. and the ef- 
fect of such experimental variables as practice on test scores long before con- 
struct validity was given a name and official respectability in the Technical 
Recommendations. 

What, then, has the concept of construct validity contributed to psycho- 
logical testing? First, it has focused attention on the desirability of basing 
test construction on an explicitly recognized theoretical foundation. Both in 
devising a new test and in setting up procedures for its validation, the investi- 
gator is urged to formulate psychological hypotheses. The proponents of con- 
struct validity have tried to integrate psychological testing more closely with 
psychological theory and experimental methods. A second contribution has 
been to stimulate the search for novel ways of gathering validation data. Al- 
though the major techniques currently employed to estimate construct validity 
have been familiar for many years, more exploration of different validation 
techniques can be expected. 

A possible danger in the application of construct validity is that it may 
Open the way for subjective, unverified assertions about test validity. Since 
construct validity is such a broad and loosely defined concept, it has been 
widely misunderstood. Some textbook writers and test constructors seem to 
perceive it as content validity expressed in terms of psychological trait names. 
Hence they present as construct validity purely subjective accounts of what 
they believe (or hope) their test measures, It is also unfortunate that the 
chief exponents of construct validity have asserted that this type of valida- 
tion “is involved whenever a test is to be interpreted as a measure of some 
attribute or quality which is not ‘operationally defined: 7 (8, p. 282). Such a 
Statement opens the door wider for fuzzy thinking about test scores and the 
traits they measure. 

Actually, the theoretic 
any test can be defined in terms of the o i 
the validity of the test. Such a definition would take into account the various 
Criteria with which the test correlated significantly, as well as the conditions 
found to affect its scores and the groups differing significantly in such scores. 
e entirely in accord with the positive contributions made 


These procedures ar 
by the concept of construct validity. It would also seem desirable to retain 


the concept of the criterion in construct validation, not as a specific practical 
Measure to be predicted, but more generally to refer to independently gath- 
ered external data. The need to base all validation on data, rather than on 
ould thus be re-emphasized, as would the need for 


al construct, trait, or behavior domain measured by 
perations performed in establishing 


armchair speculation, W 
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data external to the test scores themselves. Internal analysis of the test, 

through item-test correlations, factorial analyses of test items, etc., is never 
e i. . 

an adequate substitute for external validation. 
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CHAPTER T 


Utilization of Validity Data 


The test user is concerned with validity at either or both of two stages. 
First, when considering the suitability of a test for his purposes, he examines 
available validity reported in the test manual or other published sources. 
Through such information, he arrives at a tentative concept of what psy- 
ally measures and he judges the relevance 


chological functions the test actu 
he test. In effect, when a test user 


of such functions to his proposed use of t 
relies on published validation data, he is dealing with construct validity, re- 


gardless of the specific procedures whereby the data were gathered. Even 
s reported, the criteria employed can- 


When predictive or concurrent validity i 

not be assumed to be identical with those the test user wants to predict or 
diagnose. Jobs bearing the same title in two different companies are rarely. 
if ever, identical. Two courses in freshman English taught in different col- 
leges may be quite dissimilar. The pronounced variation in apparently similar 
criteria is borne out by the validity coefficients reported in many test manuals. 
Thus when scores on a single test are correlated with grades in a particular 
Course, the resulting validity coefficients usually vary widely from one college 
to another. At least part of this variation results from differences in the 
Specific criterion, although size and heterogeneity of the group tested also 
affect the correlations. 

Because of the specificity of each criterion, test users should check the 
validity of any chosen test against local criteria whenever possible. Although 
published data may strongly suggest that a given test should have high valid- 
ity in a particular situation, direct corroboration is always desirable. The 
determination of validity against specific local criteria represents the second 
stage in the test user's evaluation of validity. The techniques to be discussed 
in this chapter are especially relevant to the analysis of validity data ob- 
tained by the test user himself. Most of them are also useful, however. in 
understanding and evaluating the validity data reported in test manuals. 
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EXPECTANCY TABLES 


It will be recalled that in both predictive and concurrent validity, test 
scores are evaluated against an independent criterion which the test is de- 
signed to predict or diagnose, respectively. The relation between subjects” 
test scores and their criterion status can be analyzed in a number of ways. A 
simple device for expressing this relation is provided by the expectancy table 
(1, 2, 35). Such a table shows the likelihood of different criterion outcomes 
for persons obtaining each test score. 

Figures 22 and 23 represent expectancy tables for a dichotomous, or two- 
fold, criterion. Figure 22 shows the relation between scores on the Army 
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Fig. 22. Validity of Army General Classification Test in Predicti. i 
Training Course. (From Boring, 3, p. 242.) in Predicting Success in Officer 


General Classification Test (AGCT) and successful completion of officer 
training course. The chart gives the percentage of men within each AGCT 
score interval who actually received a commission in the sample investigated. 
It will be noted that this percentage increases consistently from the group 
scoring under 110 to that scoring 140 and over. Figure 23 shows a similar 
correspondence between percentage of men eliminated in primary flight 
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training and stanine score obtained on a pilot selection battery developed by 
the Air Force. While 77 per cent of the men receiving a stanine of 1 were 
eliminated, only 4 per cent of those at stanine 9 failed to complete the course 
satisfactorily, the percentage decreasing consistently over the intervening 
stanines. 


EE No. of Percentage Eliminated in Primary Pilot Training 
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Fig. 23. Validity of a Pilot Selection Battery in Predicting Success in Primary Flight 


Training. (From Flanagan, 13, p. 58-) 


The percentages given in Figures 22 and 23 provide an estimate of the 
Probabilities that individuals tested in the future will attain the specified 
criterion status. On the basis of Figure 23, for example, it is predicted that 
approximately 40 per cent of pilot cadets who obtain a stanine score of 4 
will fail, and that approximately 60 per cent will satisfactorily complete 
primary flight training. Similar statements regarding the expectancy of success 
or failure can be made about individuals who receive each of the stanines. 
Thus an individual with a stanine of 4 has a 60:40 or 3:2 chance of com- 
pleting primary flight training. 

When the criterion is à continuous rather than a dichotomous variable, the 
expectancy table can be constructed directly from a scatter diagram. In 
plotting such a scatter diagram. each individual's standing in both test and 
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criterion is tallied simultaneously, as illustrated in Figure 24. Along the base- 
line of this diagram are shown the criterion measures, grades in rhétorie, and 
along the side are given scores on the Sentences Test of the Differential 
Aptitude Tests (DAT). In this test, the subject is required to locate errors in 
grammar, punctuation, or word usage in a series of sentences. Each tally 
mark in Figure 24 shows the test score and rhetoric grade of each of 100 
freshman women tested in a state teachers’ college. The total frequencies for 
the different cells, as well as the row totals, have been indicated. 
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Grades in Rhetoric 


Fig. 24. Bivariate Distribution Showing Rel: 


ationship between Scores on DAT 
Sentences Test and Grades in Rhetoric. (From Wes 


man, 36, p. 2.) 


To convert this scatter diagram into an expect 


sary to express each cell frequency as a percentage of the corresponding row 
total. For example, of the 16 subjects who scored between 30 and 39 on 
the test, 6 per cent (1 case) received a grade of F in the rhetoric course, 
19 per cent (3 cases) received D, 56 per cent (9 cases) C, and 19 per 
cent (3 cases) B. The complete expectancy table will be found in Table 9. 
Given an individual's test score, it is possible by means of such an expectancy 
table to predict the chances of his obtaining each grade in the criterion 
variable. For example, we would expect anyone who scores above 80 to 
receive a grade of A, and anyone scoring above 70 to receive either A or 
B. Similarly, we would expect no D or F grades amon 
50 on the test. Other predictions of this sort can readily 
to Table 9. 


ancy table, it is only neces- 


g those Scoring above 
be made by reference 
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TABLE 9. Expectancy Table Showing the Relation between Scores on the DAT 
Sentences Test and Rhetoric Grades 


(From Wesman, 35, p. 2 


Test Percentage Receiving Each Grade 
Scores F D rel B A 
80-89 100 
70-79 20 80 
60-69 14 63 23 
50-59 39 35 26 
40-49 14 59 27 

30-39 6 19 56 19 

20-29 13 50 37 

10-19 100 

0- 9 100 


It should be noted, of course, that such predictions are subject to con- 
Siderable sampling error, especially when the number of cases tested is 
small. Since the individual percentages are based on the relatively few per- 
sons falling within a single cell, the chance fluctuations in such percentages 
from sample to sample will undoubtedly be large. On the other hand, presen- 
tation of validity data by means of expectancy tables is vivid and clear, and 
it permits an examination of the predictive value of the test in diflerent 


parts of the range. 


VALIDITY COEFFICIENTS 


To experienced test users, the most familiar method for reporting test 
validity is through a validity coefficient, showing the correlation between test 
and criterion. Not only does such a correlation provide a single, over-all 
index of the validity of the test. but it is also more stable and less subject 
to sampling fluctuation than the expectancy table percentages, since it is 
based on all the cases in the group. A correlation coefficient can be found 
lor any set of data that can be represented by an expectancy table. In the 
Case of two continuous variables, as illustrated in Figure 24 and Table 9, the 
familiar Pearson Product-Moment Correlation Coefficient can be computed. 


In the example cited, the correlation between DAT Sentences Test and 


rhetoric grades is .71. 


When the criterion is dichotomized, as in the situations represented by 


Figures 22 and 23, a biserial correlation may be computed. Such a correla- 


tion is based upon the difference in mean test scores obtained by the two 


Criterion groups, as well as upon the standard deviation of test scores and 
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the proportion of individuals who fall into the two criterion age ie ene 
types of correlation coefficients are available for expressing the rela = P 
between test scores and criterion under still different circumstances, as when 
both variables are dichotomized (tetrachoric correlation), or when the re- 
lationship varies.at different parts of the range (curvilinear correlation ). The 
specific procedures for computing these different kinds, of correlations can be 
found in any standard statistics text, such as those cited in Chapter 4. 

As in test reliability, discussed in Chapter 5, it is essential to specify the 
nature of the group on which a validity coefficient is determined. The sime 
test may measure different functions when given to individuals who differ an 
age, sex, educational level, occupation, or any other relevant characteristic. 
It has already been pointed out in Chapter 6, in connection with content 
validity. that the work methods employed to arrive at the same solution of a 
test problem may vary widely from group to group. Consequently, a test 
could have high validity in predicting a particular criterion in one popula- 
tion, and little or no validity in another. Or it might be a valid measure of 
different functions in the two populations. Thus, unless the validation sample 
is representative of the population on which the test is to be used, validity 
should be redetermined on a more appropriate sample. 

The question of range of scores is, of course, as relevant to the measure- 
ment of validity as it is to the measurement of reliability, 
teristics are commonly reported in terms of correl 
be recalled that, other things being equal, 
higher will be the correlation. This fact should be kept in mind when inter- 
preting the validity coefficients given in test manuals. 

A special difficulty encountered in man 
preselection. For example, a new test th 
may be administered to a group of newl 
measures of job performance will even 
however. that such employees represent 
applied for the job. Hence the range of 
criterion measures will be curtailed at t 


Since both charac- 
ation coefficients. It will 
the wider the range of scores, the 


y validation samples arises from 
at is being validated for job selection 
y hired employees on whom criterion 
tually be available. It is very likely, 
à superior selection of all those who 


for selection purposes, the validity can 
If adequate information is available regardi 


has occurred in the validation sample, correction formulas can be applied to 


9 ff). 
wer to this ques- 
fficient must take 
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into account a number of concomitant circumstances. The obtained correla- 
tion, of course, should be high enough to be significant at some acceptable 
level, such as the .01 or .05 levels discussed in Chapter 5. In other words, 
before drawing any conclusions about the validity of a test, we should be 
reasonably certain that the obtained validity coefficient could not have arisen 
through chance fluctuations of sampling from a true correlation of zero. 

Having established a significant correlation between test scores and cri- 
terion, however, we need to evaluate the size of the correlation in the light 
of the uses to be made of the test. If we wish to predict an individual's 
exact criterion score, such as the grade-point average a student will receive 
in college, the validity coefficient may be interpreted in terms of the standard 
error of estimate, which is analogous to the error of measurement discussed 
in connection with reliability. It will be recalled that the error of measurement 
indicates the margin of error to be expected in an individual's score as a re- 
sult of the unreliability of the test. Similarly, the error of estimate shows the 
margin of error to be expected in the individual's predicted criterion score, as 
à result of the imperfect validity of the test. 

The error of estimate is found by the following formula: 


/ K] 
Gest, = yV 1 — Fry 


e validity coefficient and c, is the standard 
t will be noted that if the validity were per- 
ld be zero. On the other hand, with 


in which r2,,, is the square of th 
deviation of the criterion scores. I 
fect (ra, = 1.00), the error of estimate wou 
a test having zero validity, the error of estimate is as large as the standard 
deviation of the criterion distribution (oes. = o,\/1 — 0 =o0,). Under these 
Conditions, the prediction is no better than a guess, and the range of predic- 
tion error is as wide as the entire distribution of criterion scores, Between 


these two extremes are to be found the errors of estimate corresponding to 


tests of varying validity. 
Reference to the formula for øe 
to indicate the size of the error 7e of - 
a mere guess, i.e., with zero validity. In other words, if v! — r*,, is equal 
to 1.00, the error of estimate is as large as it would be if we were to guess 
the subject's score. The predictive improvement attributable to the use of 
the test would thus be nil. If the validity coefficient is .80, the \/ 1 — rey 
is equal to .60, and the error is 60 per cent as large as it would be by chance. 
To put it differently, the use of such a test enables us to predict the individual's 


Criterion performance with a margin of error that is 40 per cent smaller than 
it would be if we were to guess. 


It would thus appear that e 


4, will show that the term \/1 — 7”, serves 
lative to the error that would result from 


ven with a validity of .80, which is unusually 
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high, the error of predicted Scores: is considerable, i the primary ME 
of psychological tests were to predict each individual's exact position Im i 
criterion distribution, the outlook would be quite pessimistic. When examine 

in the light of the error of estimate, most tests do not appear very efficient. 
In most testing situations, however, it is not necessary to predict the specific 
criterion performance of individual cases, but rather to determine which in- 
dividuals will exceed a certain minimum standard of performance, or cutoff 
point, in the criterion. Who will successfully complete medical school, primary 
flight training. or officer candidate school? Who will prove to be a satisfactory 
clerk, salesman, or machine operator? Such information is useful not only 
in group selection but also in individual guidance. For example, it is ad- 
vantageous to be able to predict that a given person has a good chance of 
passing all courses in law school, even if we are unable to estimate with 
certainty whether his grade average will be 74 or 81. 

For some purposes, a test may appreciably improve predictive efficiency 
if it shows any significant correlation with the criterion, however low. Under 
certain circumstances, even validities as low as .20 or .30 may justify in- 
clusion of the test in a selection program. It is gradually being recognized 
that the traditional evaluation of tests in terms of the error of estimate is 
unrealistically stringent. Increasing attention is being given to other ways of 
evaluating the contribution of a test, which take into account the types of 


decisions to be made from the scores. Some of these procedures will be il- 
lustrated in the following section. 


TEST VALIDITY AND DECISION THEORY 


Let us suppose that 100 applicants h 
followed up until each could be evalu 
Figure 25 shows the bivariate distribution of test scores and measures of 
job success for the 100 subjects. The correlation between these two variables 
is slightly below .70. The minimum acceptable job performance, or criterion 
cutoff point, is indicated in the diagram by a heavy horizontal line. The 40 
cases falling below this line would represent job failures; the 60 above the 
line, job successes. If all 100 applicants are hired, therefore, 60 per cent 
will succeed on the job. Similarly, if a smaller number were hired at random, 
without reference to test scores, the proportion of successes would probably 
be close to 60 per cent. Suppose, however, that the test scores are used to 
select the 45 most promising applicants out of the 1 
= .45). In such a case, the 45 individuals fallin 
vertical line would be chosen. Within this group 


ave been given an aptitude test and 
ated for success on a certain job. 


00 (selection ratio 
g to the right of the heavy 
of 45, it can be seen that 
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there are 7 job failures, or “misses,” falling below the heavy horizontal line, 
and 38 job successes. Hence the percentage of job successes is now 84 
rather than 60 (i.e., 38/45 — .84). This increase is attributable to the use 
of the test as a screening instrument. It will be noted that errors in predicted 
criterion score that do not affect the decision can be ignored. Only those 
prediction errors that cross the cutoff line and hence place the individual in 
the wrong category will reduce the selective effectiveness of the test. 
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Fig. 25. Increase in the Proportion of “Successes” Resulting from the Use of a 
Screening Test. 


A useful concept in the evaluation of screening effectiveness is that of 
“false positives.” This term has been adopted from the medical field, in 
whith a test fora pathological condition is reported to be positive if the 
condition is present. A false positive thus refers to d ense in which the test 
erroneously indicates the presence of the pathological condition. In the 
Screening situation, the false positives are the successful workers who are 
rejected on the basis of the test. Referring again to Figure 25, we find a 


total of 22 false positives within the upper left-hand quadrant of the graph. 
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These individuals fall above the criterion cutoff, but below (to the left of) 
the cutoff score on the test. On the other hand, the 33 “hits” in the lower 
left-hand quadrant of Figure 25 are job failures who were correctly identi- 
fied as such by the test. 

In setting a cutoff score on a test, attention should be given to the per- 
centage of false positives as well as to the percentages of successes and 
failures within the selected group. In certain situations, the cutoff point 
should be set sufficiently high to exclude all but a few possible failures. This 
would be the case, for example, when the job is of such a nature that a 
poorly qualified worker could cause serious loss or damage. Under other 
circumstances, it may be more important to admit as many qualified persons 
as possible, at the risk of including more failures, In the latter case, the num- 
ber of false positives can be reduced by the choice of a lower cutoff score. 
Other factors that normally determine the position of the cutoff score in- 
clude the available personnel supply, the number of 


urgency or speed with which the openings must be filled. 

In the terminology of decision theory, the example given in Figure 25 
illustrates a simple strategy, or plan for deciding which applicants to accept 
and which to reject. In this case, the Strategy was to accept the 45 persons 
with the highest test scores. The increase in percentage of successful employ- 
ces from 60 to 84 could be used as a basis for estimating the payoff, or 
net benefit to the company resulting from the use of the test. 

Statistical decision theory was developed by Wald (33) in the 1940's with 
special reference to the decisions required in the inspection and quality con- 
trol of industrial products. Many of its implications for the construction and 
interpretation of psychological tests have been systematically worked out by 
Cronbach and Gleser (9). Essentially, decision theory is an attempt to put 
the decision-making process into mathematical form, so that available in- 
formation may be used to arrive at the most effective decision under specified 
circumstances. The mathematical procedures employed in decision theory 
are often quite complex, and few are in a form permitting their immediate 
application to practical testing problems. Some of the basic concepts of deci- 
sion theory, however, are proving helpful in the reformulation and clarifica- 
tion of certain questions about tests. A few of these ideas were introduced 
into testing before the formal development of Statistical decision theory and 
were later recognized as fitting into that framework (ct 9); 

An example of such a precursor of decision theory in psychological testing 
is to be found in the Taylor-Russell Tables (28). These tables permit a 
determination of the net gain in selection accuracy attributable to the use of 
the test. The information required includes the validity coefficient of the test, 


job openings, and the 
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the selection ratio or proportion of applicants who must be accepted, and 
the proportion of successful applicants selected without the use of the test. 
A change in any of these three factors can alter the predictive efficiency of 
the test. 

For purposes of illustration, one of the Taylor-Russell Tables has been re- 
produced in Table 10. This table is designed for use when the percentage 


TABLE 10. Proportion of “Successes” Expected through the Use of Tests of Given 
Validity, When Proportion of “Successes” Prior to Use of Test Was .60 


(From Taylor and Russell, 28, p. 576) 


Selection Ratio 

Vatiaity | o io 30 — 3b dO A A MD P 
.00 60 60 60 60 60 60 60 60 60 60 .60 
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-60 06 94 90 .87 .83 .80 76 73 .69 65 63 
-65 08 96 92 .89 85 82 78 .74 70 65 63 
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1.00 i00 100 1.00 100 1.00 100 100 86 75 67 63 
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Of successful applicants selected prior to the use of the test is 60. Across the 
top are given different values of the selection ratio, and along the side are 
the test validities. The entries in the body of the table indicate the proportion 
ersons selected after the use of the test. Thus the difference 


of successful 
P ; : 
ble entry shows the increase in proportion of 


between .60 and any one ta 
Successful selections attributable to the test. 

Obviously if the selection ratio were 100 per cent, that is, if all applicants 
had to be accepted. no test, however valid, could improve the selection 
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process. Reference to Table 10 shows that, when as many as 25 per cent of 
applicants must be admitted, even a test with perfect validity (r = 1.00) 
would raise the proportion of successful persons by only 3 per cent (.60 to 
.63). On the other hand, when only 5 per cent of applicants need to be 
chosen, a test with a validity coefficient of only .30 can raise the percentage 
of successful applicants selected from 60 to 82. The rise from 60 to 82 rep- 
resents the net effectiveness of the test. It indicates the contribution the test 
makes to the selection of individuals who will meet the minimum standards 
in criterion performance. In applying the Taylor-Russell Tables, of course, 
test validity should be computed on the same sort of group used to estimate 
percentage of prior successes. In other words, the contribution of the test is 
not evaluated against chance success unless applicants were previously se- 
lected by chance—a most unlikely circumstance. Since applicants are ordi- 
narily selected on the basis of previous job history, letters of recommendation, 
interviews, and the like, the contribution of the test should be evaluated on 
the basis of what the test adds to these previous selection procedures. 

In many practical situations, what is wanted is an estimate of the effect 
of the selection test, not on percentage of workers exceeding the minimum 
performance, but on over-all output of the selected workers. How does the 
actual level of job proficiency or criterion achievement of the workers hired 
on the basis of the test compare with that of the total applicant sample that 
would have been hired without the test? Following the work of Taylor and 
Russell, several investigators addressed themselves to this question (4, 7, 
19, 26). Brogden (4) first demonstrated that the expected increase in out- 
put is directly proportional to the validity of the test. Thus the improvement 
resulting from the use of a test of validity .50 is 50 per cent as great as the 
improvement expected from a test of perfect validity. 

The relation between test validity and expected rise in criterion achieve- 
ment can be readily seen in Table 11. Expressing criterion scores as standard 


scores with a mean of zero and an SD of 1.00, this table gives the expected 


mean criterion score of workers selected with a test of given validity and 


with a given selection ratio. It will be noted, for example, that when 20 per 
cent of the applicants are hired and the validity coefficient is .50, the mean 
criterion performance is .70 SD above the mean of the unselected popula- 
tion of applicants. With the same selection ratio and a perfect test (validity 
coefficient = 1.00), the mean criterion score of the selected applicants would 
be 1.40, just twice what it would be with the test of validity .50. Similar direct 
linear relations will be found if other mean criterion performances are com- 
pared within any row of Table 11. For instance, with a selection ratio of 


60 per cent, a validity of .25 yields a mean criterion score of .16, while a 
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validity of .50 yields a mean of .32. Again, doubling the validity doubles the 
output rise. A 

Such an evaluation of test validity is obviously much more favorable than 
that based on the previously discussed error of estimate. The reason for the 
difference can be seen in the fact that prediction errors that do not affect 
decisions are irrelevant to the selection situation (36). For example, if 
Smith and Jones are both superior workers and are both hired on the basis 
of the test, it does not matter if the test shows Smith to be better than 
Jones while in job performance Jones excels Smith. 

In decision theory, the mathematical payoff function derived for any 
given strategy incorporates a number of parameters not traditionally con- 
sidered in evaluating the predictive effectiveness of tests. The selection ratio 
discussed above is one such parameter. Another is the cost of administering 
the test. Thus a test of low validity is more likely to be retained if it is short, 
inexpensive, easily administered by relatively untrained personnel, and suita- 
ble for group administration. An individual test requiring a trained examiner 
or expensive equipment would need a higher validity to justify its use. A 
further consideration is whether the test assesses an area of criterion-related 
behavior not otherwise covered by available techniques. 

Another major aspect of decision theory is the evaluation of outcomes. The 
payoff, or expected benefit of a decision strategy, is based on the probability 
of each outcome (such as job success or failure), together with estimates of 
the relative value of such outcomes. The lack of adequate systems for assign- 
ing values to outcomes is one of the chief obstacles to the wide application of 
decision theory. In industrial decisions, a dollar and cents value can fre- 
quently be employed. Even in such cases, however, certain outcomes per- 
taining to good will, public relations, and employee morale are difficult to 
assess in monetary terms. Educational decisions must take into account in- 
stitutional goals, social values, and other relatively intangible factors. In- 
dividual decisions, as in counseling, must consider the individual's prefer- 
ences and value system. It has been repeatedly pointed out, however, that 
decision theory did not introduce the problem of values into the decision 
process, but merely made it explicit. Value systems have always entered 
into decisions, but they were not heretofore clearly recognized or systemati- 
cally handled. 

Whether a psychological test is to be used in making terminal or sequential 
decisions also influences its effectiveness. For instance, instead of sorting ap- 
plicants into accepted and rejected categories only, we might introduce a 
third category for uncertain cases who are to be examined further with more 


intensive techniques. Another strategy, suitable for the diagnosis of psycho- 
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logical disorders, would be to use only two categories but to test further all 
cases classified as positives by the preliminary screening test. It should also 
be noted that many personnel decisions are in eflect sequential, although 
they may not be so perceived. Incompetent employees hired because of pre- 
diction errors can usually be discharged after a probationary period; failing 
students can be dropped from college at several stages. In such situations, it 
is only adverse selection decisions that are terminal. To be sure, incorrect 
selection decisions that are later rectified may be costly in terms of several 
value systems. But they are often less costly than terminal wrong decisions. 

Still another condition that may alter the effectiveness of a psychological 
test is the availability of alternative treatments and the possibility of adapt- 
ing treatments to individual characteristics (8). An example would be the 
utilization of different training procedures for workers at different aptitude 
levels. Such a strategy will further improve the payoff or evaluated outcome 
of the decisions based on test scores. 

To determine what a test contributes to the decision process, we must also 
or frequency of a given condition in the population to 
ive and thorough analysis of this prob- 


Consider the base rate, 
Which the test is applied. In a provocat 
lem, Mech] and Rosen (23) demonstrate that in the case of rare conditions 
(whose base rate deviates markedly from 50 per cent), the use of tests of 
low or even moderate validity may actually increase the proportion of 
wrong decisions. Suppose we have a test that can correctly reveal the pres- 
ence or absence of a psychiatric disorder in 85 per cent of the cases ex- 
amined, In other words, the number of abnormals correctly identified as 
abnormal plus the number of normals correctly identified as normal is 85 
Out of 100, Let us suppose further that the given disorder Occurs in 5 per cent 
Of the intake population of the particular clinic where this test is applied. 
Under these conditions, if we merely classified everyone as normal, we 
Would be wrong in only 5 per cent of the cases. By applying the test, however, 
We will be wrong 15 per cent of the time. : ^ ^ 

In such a situation, increasing the amount of information utilized de- 
creases the probability of correct decisions. This paradoxical consequence re- 
Sults from the large number of false positives, 1.¢., normal individuals incor- 
Tectly diagnosed as abnormal by the test. If everyone is classified as normal, 
On the other hand, there will be no false positives and the number of misses 
will be small because of the low frequency of the condition in the popula- 
tion. To identify rare conditions with better than chance success may thus 
Tequire tests of unattainably high validity. It is obviously more fruitful to 
design tests for behavior characteristics whose base rates are closer to 50. 


Another solution is to apply tests within more narrowly defined subgroups, 
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which may have less extreme base rates. For example, only 6 per cent of the 
intake population of a clinic may have organic Drain pathology. But among 
a subgroup exhibiting certain symtoms, this proportion may be 40 per cent. 
The use of the test within this subgroup would be more effective. 

When the seriousness of a rare condition makes its diagnosis urgent, tests 
of moderate validity may be employed in an early stage of sequential deci- 
sions. For example, all cases might first be screened with an easily adminis- 
tered test of moderate validity. If the cutoff score is set high enough (high 
scores being favorable), there will be few misses but many false positives. 
The latter can then be detected through a more intensive individual examina- 
tion given to all cases diagnosed as positive by the test. This solution would 
be appropriate, for instance, when available facilities make the intensive in- 
dividual examination of all cases impracticable. 

It might be noted that the base rate we have been considering corresponds 
to one of the factors utilized in the previously discussed Taylor-Russell 
Tables, namely, the proportion of applicants who succeed (or fail) prior to 
the use of the test. Thus if only 5 per cent of present employees fail on a job, 
it is unlikely that the introduction of a new selection test can improve this 
outcome appreciably. Nevertheless, such selection decisions differ in certain 
important respects from the clinical decisions illustrated above. In the em- 
ployment situation, the selection ratio (proportion of applicants to be hired) 
is imposed by external requirements, such as number of job openings, rather 
than being set so as to maximize correct classification. With externally de- 
termined selection ratios, the use of a test with an 


validity will increase the total number of correct 
difference is that, 


y significant degree of 
decisions (23). A further 
in most job-selection decisions, the major emphasis is 
placed on reducing the number of misses (job failures). In clinical situations. 
on the other hand, minimizing the number of false positives may be just as 
important. 

The examples cited provide onl 


y glimpses into the ways in which de- 
cision theory may 


affect the evaluation of psychological tests. For a more 
direct acquaintance with this field, the reader should consult the relevant 
references listed at the end of this chapter (6, 8, 9, 12, 15, 23). 


ITEM ANALYSIS AND CROSS-VALIDATION 
Both reliability and validity depend ultimately upon the characteristics 
of the items making up the test. Any test can be improved through the se- 


lection, substitution, or revision of items. Item analysis makes it possible to 


shorten a test, and at the same time increase its validity and reliability. Other 
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things being equal, a longer test is more valid and reliable than a shorter one. 
The effect of lengthening or shortening a test upon the reliability coefficient 
was discussed in Chapter 5, where the Spearman-Brown formula for esti- 
mating such an effect was also presented. These estimated changes in re- 
liability occur when the discarded items are equivalent to those that re- 
main, or when equivalent new items are added to the test. Similar changes 
in validity will result from the deletion or addition of items of equivalent 
validity. All such estimates of change in reliability or validity refer to the 
lengthening or shortening of tests through a random selection of items, with- 
Out item analysis. When, however, a test is shortened by eliminating the least 
satisfactory items, the short test may be more valid and reliable than the 
original, longer instrument. 

The validity of any item can 
that item with a criterion measure. For example, answers to a personality 
lest question can be correlated with each individual's rating for dominance. 
Since item responses are often dichotomous (yes-no, right-wrong, etc.), a 
biserial correlation is often used. If the criterion is also dichotomous (e.g., 
| vs. unsuccessful employees), still other kinds 
as tetrachoric or phi coefficients. Even when 
analysis is sometimes based on a comparison 
e. students falling in the upper and 
average may constitute the criterion 


be determined by correlating responses to 


Neurotic vs. normal, successfu 
of correlation may be used, such 
the criterion is continuous. item 
of extreme criterion groups. For exampl 
lower thirds of their class in grade-point di 
groups. Whatever the specific procedure for computing item-criterion corre- 
lations, those items showing the closest correspondence with the criterion 
are retained and the other items are discarded in the preparation of the 


final test. 

Item analysis is frequently conducted against total score on the test itself. 
As was noted in Chapter 6. this procedure yields a measure of internal con- 
sistency, not external validity. Under certain conditions, the two approaches 


items chosen on the basis of external validity 
he basis of internal consistency. Let us sup- 
a scholastic aptitude test consists of 100 
In order to select items from this 


may lead to opposite results, the 
being the very ones rejected on t 
Pose that the preliminary form of 


arithmetic items and 50 vocabulary items. 
initial pool by the method of internal consistency, the biserial correlation 


between performance on each item and total score on the 150 items may be 
Computed. It is apparent that such biserial correlations would tend to be 
higher for the arithmetic than for the vocabulary items, since the total score is 
based on twice as many arithmetic items. If it is desired to retain the 75 
“best” items in the final form of the test, it is likely that most of these items 
‘will prove to be arithmetic problems. In terms of the criterion of scholastic 
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achievement, however, the vocabulary items might have been more valid 
predictors than the arithmetic items. If such ts the case, the item analysis will 
have served to lower rather than raise the validity of the test. 

The practice of rejecting items that have low correlations with total score 
provides a means of purifying or homogenizing the test. By such a procedure, 
the items with the highest average intercorrelations will be retained. This 
method of selecting items will increase test validity only when the original 
pool of items measures a single trait and when this trait is present in the 
criterion. Most tests developed in actual practice, however, measure a com- 
bination of traits required by a complex criterion. Purifying the test in such 
a case may reduce its criterion coverage and thus lower validity. 

Probably the best way to reconcile these objectives is to sort the relatively 
homogeneous items into separate tests, or subtests, each of which will cover 
a different aspect of the criterion. Thus breadth of coverage is achieved 
through a variety of tests, each yielding a relatively unambiguous score, 
rather than through heterogeneity of items within a single test. By such a 
procedure, items with low indices of internal consistency would not be dis- 
carded, but would be segregated. Within each subtest or item group, fairly 
high internal consistency could thus be attained. At the same time, internal 
consistency would not be accepted as a substitute for item validity, and some 
attention would be given to adequacy of coverage and to the avoidance of ex- 
cessive concentration of items in certain areas. 

After the items have been selected, the validity of the final form of the 
test must be checked in a new sample. Such an independent determination of 
test validity is known as cross-validation (cf. 25). Any validity coefficient 
computed on the same sample that was used for item-selection purposes will 
capitalize on chance errors within that particular sample and will conse- 
quently be spuriously high. In fact, a high validity coefficient could result 
under such circumstances, even when the test has no validity at all in pre- 
dicting the particular criterion. 

Let us suppose that out of a sample of 100 medical students, the 30 with 
the highest and the 30 with the lowest medical school grades have been 
chosen to represent contrasted criterion groups. If, now, these two groups are 
compared in a number of traits actually irrelevant to success ia medical 


school, certain chance differences will undoubtedly be found. Thus there 


might be an excess of urban-born and of red-haired persons within the upper 
criterion group. If we were to assign each individual a “score” by crediting 


him with one point for urban residence and one point for red hair, the mean 
of such scores would undoubtedly be higher in the upper than in the lower 
criterion group. This is not evidence for the validity of the predictors, how- 
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ever, since such a validation process is based upon a circular argument. The 
two predictors were chosen in the first place on the basis of the chance varia- 
tions that characterized this particular sample. And the same chance differ- 
ences are operating to produce the mean differences in total score. When 
tested in another sample, however, the chance differences in frequency of 
urban residence and red hair are likely to disappear or be reversed. Con- 
sequently, the validity of the scores will collapse. 

A specific illustration of the need for cross-validation is provided by an in- 
vestigation conducted with the Rorschach inkblot test (21). In an attempt 
to determine whether the Rorschach could be of any help in selecting sales 
managers for life insurance agencies, this test was administered to 80 such 
managers. These managers had been carefully chosen from several hundred 
employed by eight life insurance companies, so as to represent an upper 
criterion group of 42 considered very satisfactory by their respective com- 
panies, and a lower criterion group of 38 considered unsatisfactory. The 80 
test records were studied by a Rorschach expert, who selected a set of 32 
Signs, or response characteristics, occuring more frequently in one criterion 
group than in the other. Signs found more often in the upper criterion group 
Were scored -+1 if present and 0 if absent; those more common in the lower 
group were scored —1 or 0. Since there were 16 signs of each type, total 
Scores could range theoretically from —16 to +16. ; 

When the scoring key based on these 32 signs was reapplied to the original 
group of 80 persons, 79 of the 80 were correctly classified as being in the 
Upper or lower group. The correlation between test score and criterion would 
thus have been close to 1.00. However, when the test was cross-validated on a 
Second comparable sample of 41 managers, 21 in the upper and 20 in the 
lower group, the validity coefficient dropped to a negligible 02. It was thus 
apparent that the key developed in the first sample had no validity for se- 
lecting such personnel. S 

That such results can be obtained under pure chance conditions was viv- 
idly demonstrated by Cureton (10). The criterion to be predicted was the 
&rade-point average of 29 students registered in a particular course. The 
“items” consisted of 85 tags, numbered from 1 to 85 on one side. To obtain 
a score for each subject, the 85 tags were thoroughly shaken in a container 
and dropped on the table. All tags that fell with numbered side up were 
recorded as indicating the presence of that particular item in the student's 
test performance. Twenty-nine throws of the 85 tags thus provided complete 
records for each student, showing the presence or absence of each item or 
response sign. An item analysis was then conducted, with each student's 


Stade-point average as the criterion. On this basis, 24 "items" were selected 
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out of the 85, 9 of which occurred more frequently among the students with 
higher grades, and 15 among those with lower grades. The former received 
a +1 weight, the latter —1. The sum of these item weights constituted the 
total score for each student. Despite the known chance derivation of these 
“test scores," their correlation with the grade criterion in the original group of 
29 students proved to be .82. Such a finding is similar to that obtained with 
the Rorschach scores in the previously cited study. In both instances, the ap- 
parent correspondence between test score and criterion resulted from the 
utilization of the same chance differences both in selecting items and in de- 
termining validity of total test scores. 

The amount of "shrinkage" of a validity coefficient in cross-validation 
depends in part upon the size of the original item pool and the proportion of 
items retained. When the number of original items is large and the propor- 
tion retained is small, there is more opportunity to capitalize on chance dif- 
ferences and thus obtain a spuriously high validity coefficient. Furthermore. 
if items are chosen on the basis of previously formulated hypotheses, derived 
from psychological theory or from past experience with the criterion, validity 
shrinkage in cross-validation will be minimized. For example, if a particular 
hypothesis required that the answer “Yes” be more frequent among success- 
ful students, then the item would not be retained if a significantly larger num- 
ber of "Yes" answers were given by the unsuccessful students. The opposite 
"shotgun" approach would be illustrated by assembling a miscellaneous set 
of questions with little regard to their relevance to the criterion behavior. 
and then retaining all items yielding significant positive or negative correla- 
tions with the criterion. Under the latter circumstances, we would expect 
much more shrinkage than under the former. 

Still another condition affecting amount of shrinkage in cross-validation is 
size of sample. Since spuriously high validity in the initial sample results 
from an accumulation of sampling errors, smaller groups (which yield larger 
sampling errors) will exhibit greater validity shrinkage. In summary, shrink- 
age of test validity in cross-validation will be greatest when samples are 


small, the initial item pool is large, the proportion of items retained is small, 
and items are assembled without previously formulated rationale. 


COMBINING INFORMATION FROM DIFFERENT TESTS 


For the prediction of practical criteria, not one but several tests are gen- 
erally required. Most criteria are complex, the criterion measure depending 
upon a number of different traits. A single test designed to measure such a 
criterion would thus have to be highly heterogeneous. It has already been 
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pointed out, however, that a relatively homogeneous and factorially pure test 
is more satisfactory because it yields less ambiguous scores. Hence it is usu- 
ally preferable to use a combination of several relatively homogeneous tests, 
each covering a different aspect of the criterion, rather than a single test 
consisting of a hodgepodge of many different kinds of items. 

When a number of specially selected tests are employed together to predict 
a Single criterion, they are known as a test battery. The chief problem arising 
in the use of such batteries concerns the way in which scores on the different 
tests are to be combined in arriving at a decision regarding each individual. 
The procedures followed for this purpose may be subsumed under three 
Major headings: (a) multiple regression equation, (5) multiple cutoff scores, 
ànd (c) clinical judgment. 

Multiple Regression Equation. 
Predicted criterion score for each individual on the basis of his scores on all 
the tests in the battery. The following regression equation, taken from the 
Manual of the Holzinger-Crowder Uni-Factor Tests (17, p. 22), illustrates 
this technique to predicting a student's achievement in 


The multiple regression equation yields a 


the application of 
high school mathematics courses: 


Mathematics Achievement = .21 V + 05 S + .21 N+ 32 R + 1.05 


In using this equation, the student's stanine scores on the verbal (V), spatial 
(S), numerical (N), and reasoning (R) tests are multiplied by the corre- 
Sponding weights given in the equation. The sum of these produets, plus a con- 
stant (1.05), gives the student's predicted stanine position in mathematics 


Courses, 
Suppose that Bill Jones receives the following stanine scores: 
Verbal 6 
Spatial 5 
Numerical 4 
Reasoning 8 


The estimated mathematics achievement of this student is found as follows: 


Mathematics Achievement — (21) (6) + 05) 5) + (21) (4) 
4-032) (8) + 1.05 — 5.96 


Bill's predicted stanine is approximately 6. It will be recalled (Ch. 4) that a 
Stanine of 5 represents average performance. Bill would thus be expected to 
do somewhat better than average in mathematics courses. His very superior 
performance in the reasoning tests (R = 8) and his above-average score on 
the verbal tests (V = 6) compensate 
9f computation (N = 4)- 


for his poor score in speed and accuracy 
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Specific techniques for the computation of regression equations can be 
found in many texts on psychological statistics (cf eg 16, 34). Essentially, 
such an equation is based upon the correlation of each test with the criterion, 
as well as upon the intercorrelations among the tests. Obviously, those tests 
that correlate higher with the criterion should receive more weight. It is 
equally important, however, to take into account the correlation of each test 
with the other tests in the battery. Tests correlating highly with each other 
represent needless duplication, since they cover to a large extent the same 
aspects of the criterion. The inclusion of two such tests will not appreciably 
increase the validity of the entire battery, even though both tests may cor- 
relate highly with the criterion. In such a case, one of the tests would serve 
about as effectively as the pair; only one would therefore be retained in the 
battery. 

Even after the most serious instances of duplication have been eliminated, 
however, the tests remaining in the battery will correlate with each other to 
varying degrees. For maximum predictive value, tests that make a more 
nearly unique contribution to the total battery should receive greater weight 
than those that partly duplicate the functions of other tests. In the computa- 
tion of a multiple regression equation, each test is weighted in direct pro- 
portion to its correlation with the criterion and in inverse proportion to its 
correlations with the other tests. Thus the highest weight will be assigned to 
the test with the highest validity and the least amount of overlap with the 
rest of the battery. 

The validity of the entire battery can be found by computing the multiple 
correlation (R) between the criterion and the battery. This correlation indi- 
cates the highest predictive value that can be obtained from the given bat- 
tery, when each test is given optimum weight for predicting the criterion in 


question. The optimum weights are those determined b 


y the regression equa- 
tion. 


It should be noted that these weights are optimum only for the particular 
sample in which they were found. Because of chance errors in the correlation 
coefficients used in deriving them, the regression weights may vary from 
sample to sample. Hence the battery should be cross-validated by correlating 
the predicted criterion scores with the actual criterion scores in a new sample. 
Formulas are available for estimating the amount of shrinkage in a multiple 
correlation to be expected when the regression equation is 
sample, but empirical verification is preferable whenever possible. As in the 
previously discussed case of item analysis, the shrinkage will be smaller, the 
larger the sample on which regression weights were derived. 

When validity of the battery is redetermined in à different school, f. 


applied to a second 


actory» 
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or other population from that on which the regression equation was derived, it 
is likely that the criterion will also differ. Academic success in two colleges, 
for example, may require a somewhat different constellation of aptitudes. Re- 
computing the validity of the battery under these circumstances is known as 
validity generalization (25). Through such a process, the test user will be 
able to detect not only the drop in validity attributable to chance errors, as in 
cross-validation, but also the drop due to changes in criterion definition. 

Multiple Cutoff Scores. An alternative strategy for combining test scores 
utilizes multiple cutoff points. Briefly, this procedure involves the establish- 
ment of a minimum cutoff score on each test. Every individual who falls be- 
low such a minimum score on any one of the tests is rejected. Only those 
Persons who reach or exceed the cutoff scores in all tests are accepted. An ex- 
ample of this technique is provided by the General Aptitude Test Battery 
(GATB) developed by the United States Employment Service for use in 
the Occupational counseling program of its State Employment Service of- 
fices (31). Of the nine aptitude scores yielded by this battery, those to be 
Considered for each occupation were chosen on the basis of criterion corre- 
lations as well as means and standard deviations of workers in that occupa- 
tion. 


The development of GATB occupational standards for machine cutters in 


the food-canning and preserving industry is illustrated in Table 12. In terms 
of standard scores with a mean of 100 and an SD of 20, the cutoff scores for 
this Occupation were set at 75 in Motor Coordination (K), Finger Dexterity 
(F), and Manual Dexterity (M). Table 12 gives mean, standard deviation, 
and correlation with the criterion (supervisory ratings) for each of the nine 
Scores in a group of 57 women workers. On the basis of criterion correlations, 


TABLE 12. Illustrative Data Used to Establish Cutoff Scores on GATB 


(From 32, p. 10) 


Criterion 
or Mean SD Correlation 
G Intelli R n — oss 
gence = 
V Verba s i — 085 
N Numerical e 15.9 
5 Spatial DT 23.5 - 
Form Perception eres 16.6 | 
2 Clerical Perception 39.3 20.7 
S Matar Coordination 924 18.1 
M Inger Dexterity 88.2 18.6 


Manual Dexterity 


ignificant at .01 level. 


TR 
Significant at .05 level. 
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Manual Dexterity and Motor Coordination appeared promising. Finger Dex- 
terity was added because it yielded the highest mean score in the battery, 
even though individual differences within the group were not significantly cor- 
related with criterion ratings. It would seem that women who enter or remain 
in this type of job are already selected with regard to Finger Dexterity.! 

The validity of the composite KFM pattern of cutting scores in a group of 
194 workers is shown in Table 13. It will be seen that, of 150 good workers, 


TABLE 13. Effectiveness of GATB Cutoff Scores on Aptitudes K, F, and M in 
Identifying Good and Poor Workers 


(From 32, p. 14) 


Aptitude Pattern 


Criterion 
Rating Non-Qualifying Qualifying Total 
Good 30 120 150 
Poor 30 14 44 
Total 60 134 194 


120 fell above the cutting scores in the three aptitudes and 30 were false posi- 
tives, falling below one or more cutoffs. Of the 44 poor workers, 30 were cor- 
rectly identified (hits) and 14 were not (misses). The over-all efficacy of 
this cutoff pattern is indicated by a tetrachoric correlation of ,70 between 
predicted status and criterion ratings. 

If only scores yielding significant validity coefficients are taken into ac- 
count, one or more essential abilities in which all workers in the occupation 
excel might be overlooked. Hence the need for considering also those apti- 
tudes in which workers excel as a group, even when individual differences 
beyond a certain minimum are unrelated to degree of job success. The mul- 
tiple cutoff method is preferable to the regression equation in situations such 
as these, in which test scores are not linearly related to the criterion. For ex- 
ample, up to a point, increasing speed of hand movements may lead to 
greater output in an assembly-line job. But beyond that point, greater speed 
may be of no avail because of the mechanical limitations of the operation. 
In some jobs, moreover, workers may be so homogeneous in a key trait that 
the range of individual differences is too narrow to yield a significant corre- 
Jation between test scores and criterion. 

The strongest argument for the use of multiple cutoffs rather than a re- 
gression equation centers around the question of compensatory qualifications. 


1 The data have been somewhat simplified for illustrative purposes. Actually, the final choice 
of aptitudes and cutting scores was based on separate analyses of three groups of workers on 
related jobs, on the results obtained in a combined sample of 194 cases, and on qualitative job 
analyses of the operations involved. 
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With the regression equation, an individual who rates low in one test may 
receive an acceptable total score because he rates very high in some other 
test in the battery. A marked deficiency in one skill may thus be compen- 
sated for by outstanding ability along other lines. It is possible, however, that 
Certain types of activity may require essential skills for which there is no 
Substitute, In such cases, individuals falling below the required minimum 
in the essential skill will fail, regardless of their other abilities. An opera 
Singer, for example, cannot have poor pitch discrimination, regardless of how 
Well he meets the other requirements of such a career. Similarly, operators 
Of sound-detection devices in submarines need good auditory discrimination. 
Those men incapable of making the necessary discriminations cannot succeed 
regardless of superior mechanical aptitude, general 


In such an assignment, 
With a multiple cutoff 


intelligence, or other traits in which they may excel. 
Strategy, individuals lacking any essential skill would always be rejected, 
While with a regression equation they might be accepted. 

When the relation between tests and criterion is linear and additive, on the 
other hand, a higher proportion of correct decisions will be reached with a re- 
Bression equation than with multiple cutoffs. Another important advantage 
Of the regression equation is that it provides an estimate of each person’s cri- 
terion score, thereby permitting the relative evaluation of all individuals. 
With multiple cutoffs, no further differentiation is possible among those ac- 
cepted or among those rejected. In many situations, the best strategy may in- 
Volve a combination of both procedures. Thus the multiple cutoff may be 
applied first, in order to reject those falling below minimum standards on 
any test, and predicted criterion scores may then be computed for the re- 
Maining acceptable cases by the use of a regression equations If enough is 
known about the particular job requirements, the preliminary screening may 
be done in terms of only one or two essential skills, prior to the application of 
the regression equation. 

Clinical Judgment. When tests are 


Vidual cases, as in clinical diagnosis, counse 
non practice for scores on separate tests to be utilized 


employed in the intensive study of indi- 
ling, or the selection of high-level 


Personnel, it is a comn 


by the examiner in arriving at s 
lion. To be sure, the individual's scores are interpreted with reference to any 
avail i| norms; but no statistical formula or other auto- 


m 


a decision without further statistical manipula- 


able general or loca 
atic procedure is appl 
Svaluating the individua : 
Process, the examiner interprets the individual's scores in terms of his own 
past experience with similar cases. his familiarity with particular job re- 
QUirements, or his knowledge of psychological theory and relevant pub- 


ied in combining scores from different tests or in 
I's score pattern. Through a relatively subjective 
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lished research. The result may be presented in the form of a detailed de- 
scription of personality dynamics, a specific prediction (e.g., Mr. Brown will 
make a good executive vice-president for this company; Miss Peterson will not 
respond well to psychotherapy), or both. 

In a thought-provoking book entitled Clinical Versus Statistical Prediction 
(22), Meehl discussed the process of clinical judgment and surveyed some 
20 investigations comparing the two types of prediction. The criteria pre- 
dicted in these studies included principally success in some kind of schooling 
or training (college, Air Force pilot training, etc.), response to therapy on 
the part of psychotic or neurotic patients, and criminal recidivism as well as 
institutional adjustment of reformatory inmates. Predictions were made by 
clinical psychologists, counselors, psychiatrists, and other professional persons 
with varying amounts of experience in the use of clinical procedures. Focus- 
ing only on the process of combining data, rather than on differences in the 
kind of data obtained, Meehl showed that, with only one questionable ex- 
ception, the routine application of statistical procedures yielded at least as 
many correct predictions as clinical analysis, and frequently more. In this 
connection, Meehl also called attention to the much greater cost of clinical 
predictions, in terms of both time and level of personnel required. Once a re- 
gression equation or other statistical strategy has been developed, it can be 
applied by a clerk or even a machine. 

Although the data are admittedly meager and more research is needed on 
this question, the consistency of results in Meehl's survey strongly suggests 
that when statistical formulas of known validity are available for combining 
test scores, they should be used in preference to subjective clinical judgment. 
In the light of these findings, it would seem that the clinician's chief con- 
tributions to diagnosis and prediction are in areas in which satisfactory tests 
are unavailable. Systematic interviewing, case histories, and direct observa- 
tion of behavior are still the principal sources of information on many as- 
pects of personality. Clinical methods also lend themselves better than tests 
to the evaluation of rare and idiosyncratic events which occur too infre- 
quently to permit the establishment of statistical strategies. Similarly, the 
clinician can give due consideration to the context in which events occur. 
For example, the same physical disability may have very different effects 
on the personality development of two children because of other concomitant 
traits or circumstances. To be sure, this is a problem of pattern analysis. 
which theoretically can be handled by appropriate statistical procedures; but 
when the modifying variables are numerous and each occurs infrequently, 
the statistical procedures would become too complex to be practicable. 

On the other hand, it should be noted that in a few of the studies reported 
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by Meehl, when counselors and clinicians had access to more data than were 
used in the statistical predictions, the clinical predictions were still no more 
accurate than the statistical. For example, counselors in a large state univer- 
Sity predicted academic success of entering freshmen on the basis of high 
School percentile rank, an intelligence test, several aptitude and achieve- 
ment tests, the Strong Vocational Interest Blank, a personality inventory, an 
individual record form, and interview notes (27). Statistical prediction was 
carried out by a clerk using a regression equation developed on previous 
Classes in the same university. The only predictor variables in this equation 
were high school percentile ranks and the intelligence test scores. The validity 
Coefficients of these statistical predictions against actual outcome were .45 
and .70 for men and women students, respectively; clinical predictions 
yielded corresponding validities of 35 and .69. 

It is apparent that the validity of clinical predictions against actual out- 
Comes should be systematically investigated whenever feasible. More data 
àre also needed on the consistency of predictions about the same subjects 
made by different clinicians and by the same clinicians at different times. In 
So far as possible, the process and cues on which clinical predictions are based 
Should be made explicit in clinical records. Such a practice would not only 
facilitate research and training, but would also serve to encourage reliance 
9n sound data and defensible interpretations. Finally, the "clinician as in- 
Strument" is an important concept in this connection. Undoubtedly the ob- 
jectivity and skill with which data are gathered and interpreted—and the re- 
Sulting accuracy of predictions—vary widely with the abilities, personality, 
Professional training, and experience of individual clinicians. 


USE OF TESTS FOR CLASSIFICATION DECISIONS 


of selection, placement, or 


Psychological tests may be used for purposes s 
bo r accepted or rejected. De- 


Classification. In selection, each individual is eithe . 
ciding whether or not to admit a student to college, to hire a job applicant, 
OT to accept an Army recruit for officer training are examples of selection de- 
cisions. When selection is done sequentially, the earlier stages are often 
Called “screening,” the term “selection” being reserved for the mors in- 
tensive final stages. “Screening” may also be used to designate any rapid, rough 
Selection process even when not followed by further selection procedures. 

Both placement and classification differ from selection in that no one is 
rejected, or eliminated from the program. All individuals are assigned to 
s to maximize the effectiveness of outcomes. 


Appropriate “treatments” so a Th 
score. This score may 


In Placement, the assignments are based on a single 
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be derived from a single test, such as an intelligence test. If a battery of 
tests has been administered, a composite score computed from a single re- 
gression equation would be employed. Examples of placement decisions in- 
clude the sectioning of college freshmen into different mathematics classes 
on the basis of their scores on mathematical aptitude tests, assigning appli- 
cants to clerical jobs requiring different levels of skill and responsibility, and 
placing psychiatric patients into "more disturbed" and “less disturbed" 
wards. It is evident that in each of these decisions only one criterion is em- 
ployed and that placement is determined by the individual's position along à 
single predictor scale. 

Classification, on the other hand, always involves two or more criteria. In 
a military situation, for example, classification is a major problem, since 
each man in an available manpower pool must be assigned to the military 
specialty where he can serve most effectively. Classification decisions are 
likewise required in industry, when new employees are assigned to training 
programs for different kinds of jobs. Other examples include the assignment 
of students to different curricula in college (science, liberal arts, etc.), as well 
as the choice of a field of concentration by the student. Counseling is based 
essentially on classification, since the client is told his chances of succeeding 
in different kinds of work. Clinical diagnosis is likewise a classification prob- 
lem, the major purposes of each diagnosis being a decision regarding the 
most appropriate type of therapy. 

Although placement can be done with either one or more predictors, classi- 
fication requires multiple predictors whose validity is individually deter- 
mined against each criterion. A classification battery requires a different re- 
gression equation for each criterion. Some of the tests may have weights in 
all the equations, although of different values; others may be included in 
only one or two equations, having zero or negligible weights in terms of some 
of the criteria. Thus the combination of tests employed out of the total 
battery, as well as the specific weights, differs with the particular criterion. An 
example of such a classification battery is that developed by the Air Force for 
assignment of personnel to different training programs (11). This battery. 
consisting of both paper-and-pencil and apparatus tests, provided stanine 
scores for pilots, navigators, bombardiers, and a few other air-crew special- 
ties. By finding an individual’s estimated criterion scores from the different 
regression equations, it was possible to predict whether, for example, he was 
better qualified as a pilot than as a navigator. 

Such differential prediction of criteria with a battery of tests permits @ 
fuller utilization of available human resources than is possible with a single 
general test or with a composite score from a single regression equation- 
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This is illustrated in Figures 26 and 27, which are based on the Aptitude 
Area scores employed by the Army (30, 37, 38). Each Aptitude Area corre- 
sponds to a group of Army jobs requiring common qualifications. Out of a 
total eleven-test battery, only two tests are used to determine the individual's 
Score in each Aptitude Area. Figure 26 shows the percentages of men in a 
large sample who obtained standard scores of 100 or higher on the AGCT and 
in their best Aptitude Arca, respectively. As would be expected, a considerably 
larger proportion of men (75%) reach or exceed a given minimum level of 
performance in their best Aptitude Area than on a general aptitude test such 
as the AGCT (53%). More effective utilization of individual talents can 
thus be achieved by means of a differential aptitude battery than through the 
Use of varying cutoff scores on a "general intelligence" test. 


On highest 
Aptitude Area 


Percentage 


(Each man represents 4 percent) 


Standard Scores of 100 or Higher on 


"i : - rmy cruits with 
rig. 26. Percentages of Army pen and Karcher. 37. p. 1.) 


AGCT and on Aptitude Areas. (From Willemin 
d in Figure 27. Both graphs in this 


The same point is differently illustrate i gr 
ores of the men remaining for non- 


figure show the distribution of AGCT sc e "e 
Priority assienments after the top men have been “creamed off” for priority 
jobs. In the ^ graph, AGCT is used to cream off men for priority positions. 


In other words, men with the highest AGCT scores were assigned to priority 
> 


jobs. Non-priority openings were then filled from those remaining. In the lower 


Staph, Aptitude Area scores : á ; 
Jobs who had the highest scores in the special ‘aptitudes required by "each 
Job. For example, if a priority job required chiefly mathematical aptitude, 
men highest in this aptitude would be assigned to that post, while those high 
In other aptitudes, such as spatial, motor, or verbal, were left for non-priority 
assignments that might call for just those aptitudes. It is apparent that a 


are used, those men being assigned to priority 
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ble group of men is left for non-priority assignments when priority 
viol : aed eon of specific rather than general qualifications. In other 
y tilizing each individual's special talents, it is possible to fill the 
"e ie with the men who are best qualified for those jobs, and at the 
priority jo i lifications remaining for other, non-priority 
i th good qualific 
same time have men with g 
assignments. 


A. When AGCT is Used to Cream Off Men for Priority Jobs 
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B. When Aptitude Areas are Used to Cream Off Men for Priority Jobs 
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Fig. 27. AGCT Scores of Men Remaining for Non-Priority 


Jobs When Priority Jobs 
are Filled by AGCT Scores and by Aptitude Area Scores. (Adapted from Willemin and 
Karcher, 37, pp. 2 and 3.) 


The differential validity of a classification test depends upon the differ- 
ence between its correlations with the separate criteria to be predicted. In a 
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two-criterion classification problem, for example, the ideal test would have 
a high correlation with one criterion and a zero correlation (or preferably a 
negative correlation) with the other criterion. General intelligence tests are 
relatively poor for classification purposes, since they predict success about 
equally well in most areas. Hence their correlations with the criteria to be 
differentiated would be too similar. An individual scoring high on such a test 
would be classified as successful for either assignment. and it would be im- 
possible to predict in which he would do better. In a classification battery, 
we need some tests that are good predictors of criterion A and poor predic- 
tors of criterion B, and other tests that are poor predictors of A and good 
predictors of B. Statistical procedures have been developed for selecting 
tests so as to maximize the differential validity of a classification. battery 
(5, 9, 18, 24). When the number of criteria is greater than two, such pro- 
cedures become quite complex. 

An alternative way of handling classification decisions is by means of the 
multiple discriminant function (cf. 14, 20). Essentially, this is a mathe- 
Matical procedure for determining how closely the individual's scores on a 
Whole set of tests approximate the scores typical of persons in a given occu- 
pation, curriculum, psychiatric syndrome, or other category. A person would 
then be assigned to the particular group he resembles most closely. While the 
regression equation permits the prediction of degree of success in each field, 
the multiple discriminant function treats all persons in one SBICEOLy as of 
€qual status. Group membership is the only criterion data utilized by this 
method. The discriminant function is useful when criterion scores are un- 
available and only group membership can be ascertained. Some tests, for 
instance, are validated by administering them to persons in different occupa- 
tions, although no measure of degree of vocational success is available for 
individuals within each field. 

The discriminant function is also appropriate when there is a non-linear 
relation between the criterion and one or more predictors. For example, in 


Certain personality traits there may be Eg 
lion, Individuals having either more or less of the trait in question would 
thus be at a disadvantage. It seems reasonable to expect, for instance, that 
amount of social dominance would be 


an optimum range for a given occupa- 


salesmen showing a moderately high 
Most likely to succeed, and that the chances of success would decline as 


Scores move in either direction from this region. With the discriminant func- 
tion, we would tend to select individuals falling within this optimum range. 
With the regression equation, on the other hand, the more dominant the 
Score, the more favorable would be the predicted outcome. If the correlation 
between predictor and criterion were negative, of course. the regression 
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equation would yield more favorable predictions for the low scorers. But 
there is no way whereby an intermediate score would receive maximum 
credit. Although in many instances the two techniques would lead to the 
same choices, there are situations in which persons would be differently classi- 
fied by regression equations and discriminant functions. For most psychologi- 
cal testing purposes, regression equations provide a more effective technique. 
Under certain circumstances, however, the discriminant function is better 
suited to yield the required information. 
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PART 2 


General Intelligence Tests 


CHAPTER 8 


Stanford-Binet Intelligence Scale 


In Part 1, we were concerned with the major principles of psychological 
testing. We are now ready to apply these principles to the evaluation of 
specific tests. We now know what questions to ask about each test and where 


to look for the answers. The test manuals, the Mental Measurements Year- 
andbooks, and other sources described in 


books, special journals and h 4 
ain information regarding any of the 


Chapter 2 may be consulted to obt. 
tests cited, 

_ The purpose of the remaining 
is to afford an opportunity to obs 


parts of the book is twofold. One objective 
erve the application of testing principles to a 
Wide variety of tests. Another is to acquaint the reader with a few outstand- 
ing tests in each of the major areas. No attempt will be made to provide a 
Comprehensive survey of available tests within any area. Such a survey would 
be outside the scope of this book. Moreover, it would probably be outdated 
before publication, because of the rapidity with which new tests appear. 


For these reasons, the discussion will concentrate upon a few representative 
n either because of their widespread use or be- 


t developments in testing procedure. Following 
apter 2, we shall consider general intelligence 
ties in Part 3. and measurement of 


tests in each category, chose 
Cause they illustrate importan 
the categories outlined in Ch 
tests in Part 2, differential testing of abili 


Personality characteristics in Part 4. : : 
It will be recalled that general intelligence tests are designed for use in a 


Wide variety of situations and are validated against relatively broad criteria. 
They characteristically provide a single score, such as an IQ, indicating the 
individuar's general intellectual level. An eflort is made to arrive at such an 
Over-all estimate of intellectual performance by “me sinking of shafts at 
critical points” (21, p. 4)- In other words, a wide variety of tasks is presented 
to the subject in the expectation that an adequate sampling of all important 
intellectual functions will thus be covered. In actual practice, the tests are 
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usually overloaded with certain functions, such as verbal ability, and com- 
pletely omit others. . i i . 

Since so many intelligence tests are validated against measures of academic 
achievement, they are often designated as tests of scholastic aptitude. Intelli- 
gence tests are frequently employed as preliminary screening instruments, to 
be followed by tests of special aptitudes. Such a practice is especially prev- 
alent in the testing of normal adolescents or adults for counseling, personnel 
selection, and similar purposes. Another common use of general intelli- 
gence tests is to be found in clinical testing, especially in the identification 
and classification of mental defectives. For clinical purposes, individual tests 
such as the Stanford-Binet or Wechsler scales are generally employed. 

This chapter will be concerned with the Stanford-Binet Intelligence Scale. 
Chapter 9 will consider the principal types of group tests available for differ- 
ent ages and educational levels. Non-language and performance scales will 
be treated in Chapter 10, and tests designed for the preschool and infant 
levels in Chapter 11. Chapter 12 will be devoted to the Wechsler scales, 
including both the adult form and the form for children. While they are used 
for many of the same purposes as the Stanford-Binet, the Wechsler scales 
include performance as well as verbal tests. Moreover, although adminis- 
tered as individual scales, they share many technical features with group 
tests. Chronologically, they represent a relatively recent major development 
in intelligence testing. For all these reasons, the Wechsler scales can be most 
effectively considered in the last chapter in Part 2 
other types of intelligence tests. Chapter 12 will a 
use of tests in detecting intellectual impairment 
and psychotic deterioration. 


» following a discussion of 
Iso touch upon the clinical 
associated with brain damage 


DEVELOPMENT OF STANFORD-BINET SCALES 


The original Binet-Simon Scales have alread 


y been described briefly in 
Chapter 1. 


It will be recalled that the 1905 Scale consisted simply of 30 
short tests, arranged in ascending order of difficulty. The 1908 Scale was the 
first age scale; and the 1911 Scale introduced minor improvements and 
additions. The age range covered by the 1911 revision extended from 3 
years to the adult level. Among the many translations and adaptations of 
the early Binet tests were a number of American revisions. The earliest were 
prepared by Goddard (5, 6, 7), who was at that time research director at 
the Vineland, New Jersey, Training School for mental defectives. The God- 
dard scales were essentially translations of the various Binet scales, with 
minor changes necessary to adapt the content to American children. Other 
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early revisions were developed by Kuhlmann (13), who extended the scales 
downward to the age of 3 months, by Yerkes, Bridges, and Hardwick (24). 
and by Herring (8). An account of these early American revisions, as well 
às a detailed description of the original Binet-Simon Scales, can be found in 
Peterson (17). 

The first Stanford revision of the Binet-Simon Scales, prepared by Terman 
and his associates at Stanford University, was published in 1916 (20). This 
revision introduced so many changes and additions as to represent virtually 
a new test. Over one-third of the items were new, and a number of old items 
were revised, reallocated to different age levels, or discarded. The entire 
Scale was restandardized on an American sample of approximately one 
thousand children and four hundred adults. Detailed instructions for adminis- 
tering and scoring each test! were provided, and the IO was employed for 
the first time in any psychological test. 

The second Stanford revision, appearing in 1937, consisted of two 
equivalent forms, L and M (21). In this revision, the scale was greatly ex- 
panded and completely restandardized on a new and carefully chosen sample 
of the American population. A third revision, published in 1960, provided 
a single form (L-M) in which were incorporated the best items from the two 


1937 forms (22). Without introducing any new content, it was thus possible 


to eliminate obsolescent items and to relocate items whose difficulty level 


had altered during the intervening years owing to cultural changes. 

In preparing the 1960 Stanford-Binet, the authors were faced with a 
common dilemma of psychological testing. On the one hand, frequent re- 
visions are desirable in order to profit from technical advances and refine- 


Ments in test construction 
as well as to keep the test content u toe : à ; 
tion is especially important for information items and for pictorial material 
Which may be affected by changing 
cars, and other common articles. The use of obsolete test content may seri- 
Ously undermine rapport and may alter the difficulty level of items. On the 
Other hand, revision may render much of the accumulated data inapplicable 
to the new form, Tests that have been widely used for many years have ac- 
quired a rich body of interpretive material which should be carefully weighed 
against the need for revision. It was for these reasons that the authors of the 
Stanford-Binet chose to condense the two earlier forms into one, thereby 
ds of obsolescence and discontinuity. 
reat a price to pay for accomplishing 


and from prior experience in the use of the test, 
p to date. The last-named considera- 


fashions in dress, household appliances, 


Steering a course between the twin hazar 
The loss of a parallel form was not too g 
are commonly called "tests," since each is separately adminis- 


us The items in the Binet scales 
ed and may contain several parts. 
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this purpose. As the authors point out, by 1960 there was less need for m al 
ternate form than there had been in 1937 when no other well-constructed 
individual intelligence scale was available. 

Because the 1960 revision is made up entirely of items from te 1937 
forms, any evaluation of the 1960 Stanford-Binet requires an examination of 
the procedures followed in developing the earlier forms (cf. 14; 21, Ch. 2). 
The construction of the 1937 Stanford-Binet required nearly ten years of 
research. Much preliminary work went into the assembling of promising 
new items and their tryout on small groups of children who had taken the 
earlier form of the test. For every item, curves were plotted showing the per- 
centage of children at each MA level on the 1916 test who passed the new 
item. Similarly, the mean MA of those passing an item was compared with 
the mean MA of those failing it. Items not showing a satisfactory degree of 
correspondence with 1916 MA were discarded at this stage. For preschool 
children, for whom earlier test scores were unavailable, chronological age 
was substituted for MA as the criterion in item selection. Except for these 
younger ages, items were chosen for inclusion in the provisional scales on 
the basis of their agreement with the earlier Stanford-Binet. Such a pro- 
cedure insured a certain minimum of similarity in the general area measured 
by the earlier and later forms. 

The major step in the development of the 1937 Stanford-Binet included 
the administration of the provisional forms to a carefully chosen sample of 
3184 subjects, and the final selection and allocation of items. All subjects 
were given both forms of the test, one-half taking Form L first, the other half 
Form M first. The interval between the two tests ranged from one day to 
one week. In this step, the criteria for item analysis were chronological age 
and composite total score on both provisional forms. The latter is 
consistency criterion which serves to increase the ho 
Three specific measures were employed in the 
curve of percentages of subjects passing the item 
ages; (b) curve of percent 


an internal 
mogeneity of the test. 
analysis of each item: (a) 
in successive chronological 
ages of subjects passing the item in successive in- 


tervals of total score on the two forms; (c) correlation of each item with tot 


al 
score on the two forms. 


An illustration of the first of these three procedures is to be found in 
Figure 28, which shows the percentage of subjects at each age who passed 
two of the items retained in the final forms.? It will be noted that for the 
3-year test the curve rises more steeply than for the 10- 


year test. Related to 
this difference is the fact th 


at the percentage of 3-year-olds who pass the 


? [n the standard citation of Stanford-Binet items, the letter indicates the form (L or M). the 
Roman numeral designates the year level, and the Arabic. numeral specifies the test or item 
number within that year level. Thus L,11,3 is the third item in the 3-year level of Form L. 
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3-year test (73 per cent) is greater than the percentage of 10-year-olds who 
pass the 10-year test (59 per cent). These differences in percentages passing 
an item are a necessary requirement of an age scale (14, p. 9 and Ch. 8). In 
the discussion of age scores in Chapter 4, it was seen that the standard devia- 
tion of mental ages must increase with age if that of the IO is to remain 
constant. Now, if variability of MA is greater in older groups, it means that 
there must be a greater spread in the performance of older subjects over 


adjacent year levels. Hence with increasing age, fewer and fewer subjects 
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Fig. 28. Distribution of Percentages Passing Two Tests of the 1937 Stanford-Binet. 


rom Terman and Merrill, 22, p. 15.) 
Will pass a test that corresponds to their own age level. In the 1937 Stanford- 
Binet, the percentage of at-age passes dropped from 77 at age 2 to slightly 
below 50 at the average adult level. Still lower percentages passing were 
Used in selecting items for the superior adult levels, in order to provide ade- 
quate ceiling for the test. 

In the final selection of items, consideration was given not only to age dif- 
lerentiation and internal consistency, but also to the reduction and balancing 
Of sex differences in percentage passing. An effort was made to exclude items 
passed by a significantly greater percentage of either sex, on the assumption 
that such items might reflect purely fortuitous and irrelevant differences in the 
experiences of the two sexes (14, Ch. 5). Owing to the limited number of 
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items available for each age level, however, it was not possible to eliminate 
all sex-differentiating items. In order to rule out sex differences in total score, 
therefore, the remaining sex-differentiating items were balanced, approx- 
imately the same number favoring boys and girls. For proper interpretation 
of test scores, the test user should be aware of such item-selection procedures. 
A statement that boys and girls do not differ significantly in Stanford-Binet 
IQ, for example, provides no information whatever regarding sex differences 
in intelligence. Since sex differences were deliberately eliminated in the proc- 
ess of selecting items for the test, their absence from the final scores merely 
indicates that this aspect of test construction was successfully executed. 

An important feature of the development of the 1937 Stanford-Binet is 
the care with which the standardization sample was assembled. The 3184 
subjects employed for this purpose included approximately one hundred 
children at each half-year interval from 1% to 5% years, two hundred at 
each age from 6 to 14, and one hundred at each age from 15 to 18. AII sub- 
jects were within one month of a birthday (or half-year birthday) at the time 
of testing, and every age group contained an equal number of boys and girls. 
From age 6 up, most subjects were tested in school, although a few of the 
older subjects were obtained outside of school in order to round out the 
sampling. Preschool children were contacted in a variety of Ways, many of 
them being siblings of the school children included in the sample. 

In order to obtain an adequate geographical distribution, testing was con- 
ducted in 17 communities located in 11 widely separated states. Several in- 
dices of socioeconomic level were checked in the effort to include a repre- 
sentative cross section of socioeconomic groups. Despite such precautions, the 
sampling was somewhat higher in socioeconomic level than the general popu- 
lation. There was also an excess of urban as contrasted to rural subjects, al- 
though the testers went to great lengths to sample the relatively inaccessible 
rural population. Both of these sampling inadequacies would tend to make the 
intelligence test performance of the standardization group higher than that 
of the general population. To allow for this condition, adjustments were 
made in the construction of the test such that the mean IQ of the standardiza- 
tion sample was in fact above 100. It should also be noted that the popula- 
tion sampled was limited to native-born, white subjects. Despite its limitations, 
however, the normative sampling employed in standardizing the 1937 Stan- 
ford-Binet was without doubt more nearly representative of the general popu- 
lation than any that had previously been utilized in the construction of 
psychological tests. The adequate sampling of such a broadly defined popula- 
tion is a monumental task. 

In the preparation of the 1960 Stanford-Binet, items were selected from 
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forms L and M on the basis of the performance of 4498 subjects, aged 
2% to 18 years, who had taken either or both forms of the test between 
1950 and 1954. The subjects were examined in six states situated in the 
Northeast, in the Midwest, and on the West Coast. Although these cases did 
not constitute a representative sampling of American school children, care 
was taken to avoid the operation of major selective factors. The group also in- 
cluded two stratified samples of California children, used in special statistical 
analyses. These samples consisted of 100 6-year-olds stratified with regard 
to father's occupation and 100 15-year-olds stratified with regard to both fa- 
ther's occupation and grade distribution. 

The 1960 Stanford-Binet did not involve a restandardization of the scale. 
The new samples were utilized only to check changes in item difficulty over 
the intervening period. Accordingly, the difficulty of each item was rede- 
termined by finding the percentage of children passing it at successive mental 
ages on the 1937 forms. Some items were checked for possible regional and 
socioeconomic differences through subgroup comparisons within the total 
Sample. No new material was introduced, but in a few items obsolescent 
drawings of common articles had to be altered. In these instances, the items 
Were pretested on special groups before including them in the scale. Apart 
from these minor modifications in drawings, the only content changes in the 
1960 Stanford-Binet include elimination of items, rescoring of a few items, 


and relocation of items in different year levels. 


DESCRIPTION OF THE 1960 STANFORD-BINET 


e tests in the 1960 Stanford-Binet are grouped 
to superior adult. Between the ages of 
If-year intervals. Thus there is a level 


Corresponding to age II, one to age II-6, one to age III, gna so forth. Since 
Progress is so rapid during these early ages. it proved feasible and desirable 
to measure change over six-month intervals. Between VEREN yE Eu 
levels correspond to yearly intervals. The remaining levels are designated 
as Average Adult and Superior Adult levels I, Il, ee Mol iad 
contains six tests, with the exception of the Average Adult level, which: con- 
tains eight. The tests within any one age level are of approximately uniform 
difficulty and are arranged without regard to such residual differences in 
difficulty as may be present. An alternate test is also provided at each age 
level, Being of approximately equivalent difficulty, such alternates may be 
Substituted for any of the tests in the level, if one of these tests should be 


Spoiled during its administration. 


_ As in the 1937 forms, th 
into age levels ranging from age ll 
ll and V, the test proceeds by ha 
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The materials needed for administering the Stanford-Binet include a box 
of standard toy objects for use with younger children, a set of printed cards, 
a record booklet for recording responses (or an abbreviated record form), 
and a test manual. Some of the objects employed at the preschool ages can 
be seen in Figure 29. The tasks to be performed by the subject in the various 


= =e imo 


Fig. 29. Test Materials Employed in Administering the Stanford-Binet, (Courtesy 
Houghton Mifflin Company.) 


Stanford-Binet tests run the gamut from simple manipulation to abstract 
reasoning. The following description is intended only to illustrate the wide 
variety of content covered by these scales and should not be construed as a 
complete listing of item types. Nor should it be interpreted as a classifica- 
tion of the psychological functions measured by the specific tests. Some in- 
formation on the latter question is provided by factor analysis data to be 
discussed in a later section on validity. The present description will be re- 
stricted to an objective account of the tasks performed by the subject, with 
no attempt at interpretation. 
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A few tests at the earliest age levels involve the manipulation of objects 
and a certain amount of cye-hand coordination. Among them may be men- 
tioned the simple formboard shown in Figure 29, in which the three pieces 
must be inserted into the appropriate recesses, as well as tests involving 
block building and stringing beads. In this category may also be placed the 
drawing tests which require the child to copy a circle, a square, or a dia- 
mond. Certain tests of perceptual discrimination also occur at the lower 
age levels. Examples include comparing the length of sticks and matching 
geometric forms. 

A relatively large number of tests at the lower levels involve the observa- 
tion and identification of common objects. Thus at the II-year level, the child 
is asked to point to parts of the body in a large picture of a doll. Some of the 
Small toy objects reproduced in Figure 29 are employed in identifying ob- 
Jects by name or by use. Pictures of objects are utilized at later age levels for 
the same purposes. Other tests require the subject to name objects, or pic- 


tures of objects. Still others call for the completion of pictures, or the identi- 


fication of the missing parts. Several tests ask the subject to state the similari- 


ties or differences between certain sets of objects named by the examiner. 
The last-named type of test occurs at different levels of difficulty, extending 


into the higher ages. 


A somewhat related group of tests. 
be described under the heading of practical judgment or common sense. Ina 


Series of “comprehension” questions, extending from year level 111-6 to VIII, 
the child is asked what he should do in meeting certain everyday-life situa- 
tions, In other, similar tests at these and higher year levels the subject is re- 
quired to explain why certain practices are commonly followed or certain ob- 
Jects are employed in daily living. A number of tests calling for the interpre- 
tation of pictorially or verbally presented situations, or athe detection of 
absurdities in either pictures or brief stories, also seem to fall into this category. 
Memory tests are found throughout the scale and utilize a wide variety of 
Materials. The subject is required to recall or recognize objects, pictures, 
Scometric designs, bead patterns, digits, sentences. and the content of pas- 
Sages, 
; Several tests of spatial orient : 
Include maze-tracing, paper-folding, paper-cutting, rearrangement. of geo- 
Metric figures, and directional orientation. Numerical tests range from rudi- 
Mentary quantitative concepts and counting, through the simple arithmetic 
Problems encountered in elementary school, to more complex arithmetic 
'Téasoning problems involving novel solutions and the inductive formula- 


tion of rules. 


also found over a wide age range, may 


ation occur at widely scattered levels. These 
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The most numerous type of test, especially at the upper age levels, is that 
employing verbal content. In this category are to be found such well-known 
tests as vocabulary, analogies, sentence completion, disarranged sentences, 
defining abstract terms, and interpreting proverbs. Some stress verbal flu- 
ency, as in naming unrelated words as rapidly as possible, giving rhymes, or 
building sentences containing three given words. It should also be noted in 
this connection that many of the tests that are not predominantly verbal in 
content nevertheless require the understanding of fairly complex verbal in- 
structions. That the scale as a whole is still heavily weighted with verbal 
ability is indicated by the correlations obtained between the 45-word vo- 
cabulary test included in Form L and mental ages on the entire scale. These 
correlations were found to be .71, .83, .86, and .83 for groups of subjects 
aged 8, 11, 14, and 18 years, respectively (14, pp. 139-140). These correla- 
tions are at least as high as those normally found between tests designed to 
measure the same functions, and they fall within the range of common re- 
liability coefficients. Such data suggest that the Stanford-Binet scale as a 
whole measures to a large extent the same functions as the vocabulary test. 

In concluding this description of the Stanford-Binet, mention should be 
made of the abbreviated scale. Four tests in each year level were selected 
on the basis of validity and representativeness to constitute a short scale for 
use when time does not permit the administration of the entire scale. These 
tests are marked with an asterisk on the record booklets. Comparisons be- 
tween full-scale and abbreviated-scale IQ's on a variety of groups (cf. 22, 
pp. 61-62; 23, p. 262) show a close correspondence between the two, the 
correlations being approximately as high as the reliability coefficient of the 
full scale. The mean IQ, however, tends to run slightly lower on the short 
scale, a discrepancy that is brought out even more vividly when individual 
cases are considered. Thus over 50 per cent of the subjects received lower 
IO's on the short version, while only about 30 per cent scored higher. 


ADMINISTRATION AND SCORING 


In common with most individual intelligence tests, the Stanford-Binet re- 
quires a highly trained examiner. Both administration and scoring are fairly 
complicated for many of the tests. Considerable familiarity and experience 
with the scale are therefore required for a smooth performance. Hesitation 
and fumbling may be ruinous to rapport. Slight inadvertent changes in word- 

3 Since these are part-whole correlations, they are spuriously raised by the inclusion of the 


vocabulary test in the determination of MA. Such eflect, however, is slight, since the vocabulary 
test constitutes less than 5 per cent of the total number of test items (14, p. 140). 
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ing may alter the difficulty of items. A further complication is presented by 
the fact that tests must be scored as they are administered, since the subse- 
quent conduct of the examination depends upon the child's performance on 
previously administered levels. 

In taking the Stanford-Binet, no one subject tries all items. Each individ- 
ual is tested only over a range of age levels suited to his own intellectual 
level, Testing usually requires no more than thirty to forty minutes for 
younger children and not more than one hour and a half for older subjects. 
The standard procedure is to begin testing at a level slightly below the ex- 
Pected mental age of the subject. Thus the first tests given should be easy 
€nough to arouse confidence, but not so easy as to cause boredom and an- 
Doyance. If the subject fails any test within the year level first administered, 
the next lower level is given. Such a procedure continues until a level is 
Teached at which all tests are passed. This level is known as the basal age. 
Testing is then continued upward to a level at which all tests are failed, 
designated as the ceiling age. When this level is reached, the test is discon- 
tinued, 

Scoring of individual items, or tests, follows an all-or-none system. For 
Sach test, the minimal performance that constitutes "passing" is specified in 
the manual, For example, in identifying objects by use at year level I1-6, the 
Child passes if he correctly identifies three out of six designated objects; in 
l'epeating five digits from memory at year level VII, correct response on any 
One of three series is counted as a pass; in answering comprehension questions 
at year level VIII, any four correct answers out of six fepresent a passing 
Performance, Certain tests appear in identical form at different year levels, 
but are scored with a different standard of passing. Such tests are adminis- 
tered only once, the subject’s performance determining the year level at 

ulary test, for example, may be scored 


Which they are credited. The vocab : 
anywhere from level VI to Superior Adult III, depending upon the number 


9f words correctly defined. 


The items passed and failed by any one individual will show a certain 


amount of scatter among adjacent year levels. We do not find that individuals 
Pass all tests at or below their mental age level and fail all tests above such a 
level, stead. the. successfully passed tests are spread over several year 
levels, bonded by he subject's basal age at one extreme and his ceiling age 
At the other, The subjects mental age on the Stanford-Binet is found by 
crediting him with his basal age and adding to that age further months of 
Credit for every test passed beyond the basal level. In the half-year levels 
tween II and V, each of the six tests counts as one month; between VI 
and XIV, each of the six tests corresponds to two months of credit. Each of 
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the adult levels (AA, SA I, SA II, and SA III) covers more than one year 
of mental age, the months of credit for each test being adjusted accordingly. 
For example, the Average Adult level includes eight tests, each of which is 
credited with two months; the Superior Adult I level contains six tests, each 
receiving four months. The highest mental age theoretically attainable on 
the Stanford-Binet is 22 years and 10 months. Such a score is not, of 
course, a true mental age, but a numerical score indicating degree of su- 
periority above the Average Adult performance. It certainly does not corre- 
spond to the achievement of the average 22-year-old individual, since the 
latter would receive a mental age of 15-9. For any adult over 18 years of 
age, a mental age of 15-9 corresponds to an IO of 100 on this scale. 

A major innovation introduced in the 1960 Stanford-Binet was the sub- 
stitution of deviation IO's for the ratio IO's used in the earlier forms. Such 
deviation IQ’s are standard scores with a mean of 100 and an SD of 16. 
As explained in Chapter 4, the principal advantage of this type of IQ is that 
it provides comparable scores at all age levels, thus eliminating once for all 
the vagaries of ratio IO's. Despite the care with which the 1937 scales were 
developed in the effort to obtain constant IO variability at all ages, the SD's 
of ratio IO's on these scales fluctuated from a low of 13 at age VI to a high 
of 21 at age II-6. Thus an IQ of 113 at age VI corresponded to an IQ of 
121 at age II-6. Special correction tables were developed to adjust for the 
major IQ variations in the 1937 scales (14, pp. 172-174). All these diffi- 
culties were circumvented in the 1960 form through the use of deviation 
IO's, which automatically have the same SD throughout the age range. To 
facilitate procedure, Pinneau developed tables in which deviation IQ’s can 
be looked up by entering MA and CA in years and months. These Pinneau 
tables are reproduced in the Stanford-Binet manual. 

A further change introduced in the 1960 form stems from the recognition 
that improvement on the test continues to age 18, rather than ceasing at age 
16 as assumed in the 1937 revision. Several major longitudinal studies con- 
ducted by different investigators over the intervening quarter-century strongly 
suggested that the abilities measured by intelligence tests continue to im- 
prove longer than had been supposed. The most direct evidence that im- 
provement on the Stanford-Binet continues beyond age 16 was provided by 
Bradway, Thompson, and Cravens (4), who retested subjects from the 1937 
standardization sample after intervals of 10 and 25 years. When first tested, 
these subjects had been from 2 to 5% years old. Mean IQ's remained 
virtually unchanged from first to second testings, but showed a significant 
rise of 11.3 points between second and third testings. The latter increase re- 
sulted from the fact that the subjects had continued to improve beyond age 
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16, while the computation of 1937 IO's assumed termination of growth at 
that age. ? 

The specific procedure followed in finding MA and IO on the 1960 Stan- 
ford-Binet is illustrated in Table 14. In the upper part of this table is the 


TABLE 14. Computation of Stanford-Binet Mental Ages and Intelligence Quotients 


Number of Months Credit 
Year Level Tests Passed per Test Total Credit 
6-YEAR-OLD CHILD 
IV 6 Basal age 4 yrs. 0 mos. 
IV-6 5 1 5 
y 3 1 3 
VI 3 2 6 
VIL 2 2 4 
VIIL 1 2 2 
IX 0 Ceiling age 
4 yrs. + 20 mos. 
MA = 5-8 CA = 6-4 IQ = 88 
35-YEAR-OLD ADULT 
XIII 6 Basal age 13 yrs. 0 mos. 
XIV 5 2 10 
AA 6 2 12 
SA I 3 4 12 
SA Il 2 5 10 
SA IIL 0 Ceiling age 


13 yrs. + 44 mos. 


MA — 16-8 CA — 18-0 IQ = 106 


Tecord of a child whose chronological age is 6 years and 4 months (CA = 
cay It will be noted that the basal age is IV and the ceiling age IX. Addi- 
tional credits total to 20 months, or 1 year and 8 months. The MA is thus 
5-8, By reference to the Pinneau tables, this child’s IQ is found to be 88. In 
the lower part of Table 14 will be found the record of a 35-year-old adult. 
As for anyone whose age is 18 or over, CA is taken as 18 in looking up the 
IQ. This subject's basal age is XIII, and he earns 44 additional months 


Credit, giving him an MA of 16-8 and an IQ of 106. 


RELIABILITY 


; The reliability of the 1937 Stanford-Binet was determined by correlating 
iQ S on Forms L and M administered to the standardization group within an 
Interva] of one week or less. Such reliability coefficients are thus measures of 
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equivalence and stability over a relatively short period. An exceptionally 
thorough analysis of the reliability of this test was carried out with reference 
to age and IO level of subjects (cf. 14, Ch. 6). In general, the Stanford- 
Binet tends to be more reliable for the older than for the younger ages, and 
for the lower than for the higher IO's. Thus at ages 242 to 5!^, the reliability 
coefficients range from .83 (for IO 140-149) to .91 (for IO 60-69); for ages 
6 to 13, they range from .91 to .97, respectively, for the same IQ levels; and 
for ages 14 to 18, the corresponding range of reliability coefficients extends 
from .95 to .98. 

The fact that high IO's tend to have lower reliability than low IQ's re- 
sults from the intrinsic properties of an age scale. The relation between ex- 
tent of variability and mental age—cited earlier in discussions of the ratio IO 
and of the percentage of subjects passing items at different age levels—is 
also at the root of these reliability differences. For a detailed explanation of 
how this comes about, the reader is referred to McNemar (14, Ch. 6). The 
relation between IQ level and reliability within a single age group is illus- 
trated in Figure 30, which shows the bivariate distribution of IO's obtained 
by 7-year-old children on Forms L and M. It will be observed that the in- 
dividual entries fall close to the diagonal at lower IQ levels and spread far- 
ther apart at the higher levels. This indicates closer agreement between L 
and M IO's at lower levels and wider discrepancies between them at upper 
levels. With such a “fan-shaped” bivariate distribution, or scatter diagram, a 
single correlation coefficient is misleading. For this reason, separate reli- 
ability coefficients have been reported for different portions of the IQ range. 

On the whole, the data indicate that the Stanford-Binet is a highly reliable 
test, most of the reported reliability coefficients for the various age and IO 
levels being over .90. Such high reliability coefficients were obtained despite 
the fact that they were computed separately within each age group. It will be 
recalled in this connection that all subjects in the standardization sample 
were tested within a month of a birthday or half-year birthday. This nar- 
rowly restricted age range would tend to produce lower reliability coefficients 
than found for most tests, which employ more heterogeneous samples. Trans- 
Jated in terms of individual IQ's, a reliability coefficient of .90 and an SD 
of 16 give an error of measurement of approximately 5 IO points (cf. Ch. 5). 
In other words, the chances are about 2:1 that a child's true Stanford-Binet 
IO differs by 5 points or less from the IO obtained in a single testing, and the 
chances are 99:1 that it varies by no more than 13 points. Reflecting the 
same differences found in the reliability coefficients, these errors of measure- 
ment will be somewhat higher for younger than for older children, and some- 
what higher for brighter than for duller individuals. 
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y of the Stanford-Binet: Bivariate Distribution of 
hildren on Forms L and M. (From Terman and 
mission of Houghton Mifflin Company.) 


I Fig. 30. Parallel-Form Reliabilit 
Q's Obtained by Seven-Year-Old CI 
Merrill, 21, p. 45; reproduced by per 


VALIDITY 


Some information bearing on the content validity of the Stanford-Binet is 
Provided by the description of item types given earlier in this chapter. The 
lasks were obviously chosen so as to call into play such functions as accuracy 
of observation, practical judgment, memory for many kinds of material, abil- 
ity to follow directions, spatial visualization, reasoning, and the handling 
of abstract concepts. Verbal abilities clearly predominate, especially at the 
higher mental ages. Skills acquired in school, such as reading and arithmetic, 
are required for successful performance at the upper year levels. In sonar as 
all of these functions are relevant to what is commonly regarded as “intelli- 
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gence" the scale may be said to have content validity. The preponderance of 
verbal content at the upper levels is defended by the test authors on theoret- 
ical grounds. Thus they write: 


At these levels the major intellectual differences between subjects reduce largely to 
differences in the ability to do conceptual thinking, and facility in dealing with con- 
cepts is most readily sampled by the use of verbal tests. Language, essentially, is the 
shorthand of the higher thought processes, and the level at which this shorthand 
functions is one of the most important determinants of the level of the processes 
themselves (21, p. 5). 


Continuity in the functions measured in the 1916, 1937, and 1960 scales 
was insured by retaining in each revision only those items that correlated 
satisfactorily with mental age on the preceding form. Age differentiation 
represents the major criterion in the selection of Stanford-Binet items. Hence 
there is assurance that the Stanford-Binet measures abilities that increase with 
age during childhood and adolescence in our culture. In each form, 
consistency was a further criterion for item selection. Tha 
deal of functional homogeneity in the St 


internal 
t there is a good 


anford-Binet, despite the apparent 
variety of content, is indicated by a mean item-scale correlation of .66 for 


the 1960 revision. The predominance of verbal functions in the sc 


by the higher correlation of verbal than non-verbal items with 
on the total scale (22, p.34). 


Further data pertaining to validity 


ale is shown 
performance 


are provided by a series of factor analy- 
ses of Stanford-Binet items. If IO's are to be comparable at different ages, 


the scale should have approximately the same factorial composition at all age 
levels. For an unambiguous interpretation of IQ’s, moreover, the scale should 
be highly saturated with a single common factor. The latter point h 
been discussed in connection with homogeneity (cf. Ch. 5). 
were heavily weighted with two group factors, 
aptitudes, 


as already 
If the scores 
such as verbal and numerical 
an IO of, let us say, 115 obtained by different persons might indi- 
cate high verbal ability in one case and high numerical ability in the other. 
McNemar (14, Ch. 9) conducted separate factorial analyses of Stanford- 
Binet items at 14 age levels, including half-year groups from 2 to 5 and year 
groups at ages 6, 7. 9, 11, 13, 15, and 18. The number of subjects employed 
in each analysis varied from 99 to 200, and the number of items ranged from 
19 to 35. In each of these analyses, tetrachoric correlations were computed 
between the items, and the resulting correlations were factor analyzed. By 
including items from adjacent year levels in more than one analysis. some 
evidence was obtained regarding the identity of the common factor at differ- 
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ent ages. The examination of tests that recur at several age levels provided 
further data on this point. In general, the results of these analyses indicated 
that performance on Stanford-Binet items is largely explicable in terms of a 
single common factor. Evidence of additional group factors was found at a 
few age levels, but the contribution of these factors was small. It was like- 
Wise demonstrated that the common factor found at adjacent age levels was 
essentially the same, although such a conclusion may not apply to more 
widely separated age levels. In fact, there was some evidence to suggest that 
the common factor becomes increasingly verbal as the higher ages are ap- 
proached. The common factor loading of the vocabulary test, for example. 
Tose from .59 at age 6 to .91 at age 18. 

In a more intensive search for the contribution of group factors, Jones (11, 
12) factor analyzed Stanford-Binet items separately in four groups of chil- 
dren aged 7, 9, 11, and 13 years. Each group consisted of 100 boys and 100 
girls. At each age level, the results revealed a number of distinct but corre- 


lated abilities. Among them were several verbal, memory, reasoning, spatial 
Visualization, and perceptual factors. Moreover, both the factors identified 
and their relative weights varied somewhat from one age level to another. 
Further evidence that the Stanford-Binet IQ may depend upon different func- 
tions at different ages is provided by Hofstaetter's factor analysis of the per- 
formance of a single group of subjects retested over 18 years (9). This in- 
Vestigation suggested that "persistence" is an important determiner of IO 
below age 4, but that "manipulation of symbols" is the principal ability meas- 
ured after that age. 

Data on both concurrent 
been reported chiefly in terms 
the publication of the original 1 


and predictive validity of the Stanford-Binet have 
of academic achievement as a criterion. Since 
916 Scale, many correlations have been com- 
puted between Stanford-Binet IO and school grades, teachers’ ratings, and 
achievement test scores. Most of these correlations fall between .40 and .75. 
School progress was likewise found to be related to Stanford-Binet IO, chil- 
dren who were accelerated by one or more grades averaging considerably 
higher in 1Q than those at normal age-grade location, and children who were 
Tetarded by one or more grades averaging considerably below (14, Ch. 3). 

Like most intelligence tests, the Stanford-Binet correlates highly with per- 
formance in nearly all academic courses, but its correlations are highest with 
the predominantly verbal courses, such as English and history. The following 
Correlations, found between Form L IO and achievement test scores of high 
School sophomores are typical (3). The number of cases used in computing 


these correlations varied from 78 to 200. 
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Reading comprehension 223 Spelling .46 
Reading speed .43 History .59 
English usage 59 Geometry 48 
Literature acquaintance -60 Biology 54 


In another study, a correlation of .64 was obtained between Form L IO and 
first-year college grades in a group of 67 college freshmen (16). Other studies 
(2, 18) of college freshmen have yielded correlations in the .50's between 
Stanford-Binet IO's and college grades. With college groups, both selective 
factors and insufficient test ceiling undoubtedly tend to lower the correlations. 

The long-range stability of Stanford-Binet IQ's may also be regarded as 
evidence of predictive validity. Longitudinal studies of the same subjects 
over periods of 10 to 25 years contribute to our understanding of the be- 
havior sampled by such tests as the Stanford-Binet. Such information belongs 
more properly under the heading of validity than under the heading of reli- 
ability, since it concerns broad, enduring behavioral changes rather than tem- 
porary fluctuations in specific test performance (cf. Ch. 5). 

Several longitudinal studies have provided data on long-range stability of 
"intelligence" as measured by the Stanford-Binet and other common tests (cf. 
1, pp. 231-238; 22, pp. 16-17). As might be expected, retest correlations 
are higher, the shorter the interval between tests. In one investigation with 
the Stanford-Binet, for example, the correlation between tests given to the 
same children at 3 and 4 years of age was .83 (19). Correlations between 
the 3-year tests and retests at successive ages decreased constantly until at 
age 12 the correlation had dropped to .46. With constant intervals between 
tests, retest correlations are generally higher the older the children. This is 
understandable, since the older the individual, the more of his intellectual de- 
velopment has already occurred. Hence subsequent changes make relatively 
little difference in his intellectual status. Of special relevance to the Stanford- 
Binet is the follow-up conducted by Bradway, Thompson, and Cravens (4) 
on children originally tested between the ages of 2 and 5⁄2 as part of the 
1937 Stanford-Binet standardization sample. Initial IO's correlated .65 with 
10-year retests and .59 with 25-year retests. The correlation between the 
10-year retest (Mean age — 14 years) and 25-year retest (Mean age — 29 
years) was 835 

In terms of group trends, the long-range predictive validity of a Stanford- 
Binet IQ, especially when obtained with school-age children, is remarkably 
high. In individual cases, however, large upward or downward shifts in IO 
may occur over an interval of a few years. In one extensive longitudinal 
project (10), individual IO changes of as much as 50 points were observed. 
Between the ages of 6 and 18, when retest correlations are generally high. 
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59 per cent of the cases changed 15 or more IO points, 37 per cent changed 
20 or more points, and 9 per cent changed 30 or more points. Nor were 
these changes random or erratic in nature. Some children exhibited consistent 
upward or downward trends over several years. Large shifts in IO, more- 
Over, were usually associated with the cultural milieu and emotional climate 
in which the child was reared. Children in underprivileged environments 
tended to lose and those in superior environments to gain with age, in rela- 
tion to the norms. 

Severe illness, changes in home or familial conditions, therapeutic and 
Temedial programs, and other major environmental variables operating dur- 
ing childhood are likely to be reflected in sharp rises or drops in individual 
IO's. The literature on the effects of environmental conditions upon intellec- 
tual development is extensive. and it is beyond the scope of this book to sur- 
vey it (cf. 1). Its findings. however, contribute to our understanding of the 
Construct "intelligence" which such tests as the Stanford-Binet ty to measure. 
It is certainly evident that any one IQ needs to be interpreted in the light of 


all available background information about the individual. 


EVALUATION 


One of the chief advantages of the Stanford-Binet derives from the mass 
Of interpretive data and extensive clinical experience that have been ac- 
cumulated regarding this test. For many clinicians, educators, and others con- 
Cerned with the evaluation of general ability level, the Stanford-Binet IQ has 
elligence. Much has been learned about 
what sort of behavior can be expected from a child with an IQ of 50 or 80 
Or 120 on this test. The distributions of 1Q’s in the standardization samples 
for the 1916 Stanford-Binet, and later for the 1937 Tevion; have provided 
a common frame of reference for the interpretation of IO s. In Table 15 will 

© found the distribution of the composite L and M IQ's of the 1937 stand- 


ardization sample, showing percentage of cases falling within each 10-point 


interval of IQ, as well as the descriptive terms commonly applied to these 


€vels. The same data are presented graphically in Figure 31. It will be noted 
ation to à normal curve was attained in developing 


become almost synonymous with int 


that a close approxim 
these scales. 

The Widespread use of such a 
{estionable help in standardizi 
Carries certain dangers. Like all classifications of 
rigidly applied, nor used to the exclusion of other 

ere are, of course, no sharp dividing lines betwee 


classification of IQ levels, although of un- 
ng the interpretation of test performance, 
persons, it should not be 
data about the individual. 
n the “mentally defective” 
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TABLE 15. Distribution of Standardization Sample in Composite Stanford-Binet IQ 
on Forms L and M 


(From Merrill, 15, p. 650) 


Percentage 
1Q of Cases Classification 
160-169 0.03 
150-159 0.2 Very superior 
140-149 1.1 
130-139 3.1 . 
120-129 82 Superior 
110-119 18.1 High average 
100-109 23.5 
90-99 23.0 Normal or average 
80-89 14.5 Low average 
70-79 5.6 Borderline defective 
60-69 2.0 
50-59 0.4 
40-49 0.2 Mentally defective 
30-39 0.03 


and the "borderline," or between the "superior" and the "very superior." In- 
dividuals with IO's of 60 have been known to make satisfactory adjustments 
to the demands of daily living, while some with IO's close to 100 require in- 
stitutional care. Persons with IO's of 160 do occasionally lead undistinguished 
lives, while some with IO's much closer to 100 make outstanding contribu- 
tions. Decisions regarding institutionalization, parole, or discharge of mental 
defectives must take into account not only IQ but also social maturity, emo- 
tional adjustment, physical condition, and other circumstances of the indi- 
vidual case. Nor is high IQ synonymous with genius. High-level achievement 
may require in addition creativity, originality, special talents, persistence, 
singleness of purpose, and other propitious emotional and motivational fac- 
tors. 

In interpreting the IQ, it should be borne in mind that the Stanford-Binet 
is primarily a measure of scholastic aptitude and heavily loaded with verbal 
functions, especially at the upper levels. Individuals with a language handi- 
cap, as well as those whose strongest abilities lie along non-verbal lines, will 
thus score relatively low on such a test. Similarly, there are undoubtedly 2 
number of fields in which scholastic aptitude and verbal comprehension are 
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not of primary importance. Obviously, to apply any test to situations for 
which it is inappropriate will only reduce its effectiveness. Because of the 
common identification of Stanford-Binet IQ with the very concept of intelli- 
gence, there has been a tendency to expect too much from this one test. 


Pree 
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o 


Percentage of Cases 


ON 50 0 
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Fig. 31, Distribution of Standardization Sample in Composite IQ on Forms L 


and M of the Stanford-Binet. (From Terman and Merrill, 21, p. 37; reproduced by 


Permission of Houghton Mifflin Company.) 


ise unsuited to the measurement of differential 


aptitudes, Although for many years clinicians tried to analyze the individual's 
Performance on different types of items, as a qualitative supplement to the 
anted and has now been largely abandoned. In 
tems do not recur at all age levels. And 
he fact that a somewhat different com- 


The Stanford-Binet is likew 


lQ, such a practice is unwarr 
the first place, the same types of i 
factorial analyses have corroborated t 
bination of abilities is measured at different ages. In the second place, the 


Number of items of each type is too small to permit a reliable determination 
Of the individual's performance on separate item groups. The difference be- 
tween an individual's achievement on, Jet us say, the spatial orientation and 
Memory items might thus be due entirely to chance. Finally, it is difficult 
to determine the psychological functions measured by an item through a 
Mere inspection of its content. The previously cited factor analyses of Stan- 
ford-Binet items showed that a single general factor accounts for a large part 
Of the variance. Moreover, the scales were deliberately constructed so as to 
Maximize the contribution of such a general factor and to minimize the in- 


fluence of group factors or separate abilities. 
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The Stanford-Binet is not suitable for adult testing, especially within the 
normal and superior range. Despite the three Superior Adult levels, there is 
insufficient ceiling for most superior adults. In such cases, it is often impos- 
sible to reach a ceiling age level at which all tests are failed. Moreover, most 
of the Stanford-Binet tests have more appeal for children than for adults, 
the content being of relatively little interest to most adults. It will also be 
recalled that the MA concept is essentially inapplicable to normal and su- 
perior adults, mental ages beyond 14 not being interpretable in the same sim- 
ple and straightforward manner as those below 14. 

Many clinicians regard the Stanford-Binet not only as a standardized test. 
but also as a clinical interview. The very characteristics that make this scale 
so difficult to administer also create opportunities for interaction between 
examiner and subject, and provide other sources of clues for the experienced 
clinician. Even more than most other individual tests, the Stanford-Binet 
makes it possible to observe the subject’s work methods, his approach to a 
problem, and other qualitative aspects of performance. The examiner may 
also have an opportunity to judge certain personality characteristics, such as 
willingness, self-confidence, social confidence, and attention. Any qualitative 
observations made in the course of Stanford-Binet administration should, of 
course, be clearly recognized as such and ought not to be interpreted in the 
same manner as objective test scores. The value of such qualitative observa- 
tions depends to a large extent upon the skill, experience, and psychological 
sophistication of the examiner, as well as upon his awareness of the pitfalls 
and limitations inherent in this type of observation. 

For the purposes for which it was designed, the Stanford-Binet is un- 
doubtedly a very successful instrument. Its viability over the years attests tO 
the widespread recognition of its merits. No evaluation of the Stanford-Binet 
would be complete without a reference to the technical quality of the pro^ 
cedures followed in its development. The construction of an age scale re- 
quires tremendous expenditure of time and effort. As McNemar wrote in 
connection with the preparation of the 1937 forms, "Many have pointed out 
the difficulties in constructing an age scale, but only those who have been 
through the mill are in a position to appreciate fully the obstacles" (14, pP- 
83-84). The procedures followed in determining the reliability of the Stan- 
ford-Binet are unusually thorough, and the reliability of the scale is high. A 
wealth of validation data has accumulated over the years to give meaning to 
the Stanford-Binet 1Q. The elimination of obsolescent item content and the 
substitution of the deviation IQ for the ratio IO in the 1960 revision provided 
the needed rejuvenation for continued use. 
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Group Tests 


While individual scales such as the Stanford-Binet find their principal ap- 
Plication in the clinic, group tests are used primarily in the educational sys- 
tem, in industry, and in the armed services. It will be recalled that mass test- 
ing began during World War I with the development of the Army Alpha 
and the Army Beta for use in the United States Army (cf. Ch. 1). The 
former was a verbal test designed for general screening and placement pur- 
Poses. The latter was a non-language test for use with individuals who could 
Not properly be tested with the Alpha owing to foreign-language background 
9r illiteracy. The pattern established by these tests was closely followed in 
the subsequent development of a large number of group tests for civilian ap- 
plication. 

In this chapter, some of the outstanding examples of group tests in current 
Use will be considered. For convenience, these tests have been classified 
With reference to the age level for which they are designed. Such a classifica- 
tion is only approximate, however, since a number of test series include over- 
lapping forms suitable for widely varying age levels. The same integrated 
test series may thus yield comparable scores from the primary grades to the 
high school senior or even the college level. The tests to be discussed in this 


chapter have been selected to represent the content and scope of available 
age tests, which will be covered in the next 


two tests will be examined at 
y name only—for an evalua- 
ade to the test manuals as well as to 
d other sources cited in Chapter 2. 


TESTS FOR THE PRIMARY LEVEL 


s proved feasible to employ group tests is 
el. At the preschool ages, individual test- 
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th The youngest age at which it ha 
n kindergarten and first-grade lev 
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ing is required in order to establish and maintain rapport, as well as to ad- 
minister the oral and performance type of items suitable for such children. 
By the age of 5 or 6, however, it is possible to administer printed tests to small 
groups of no more than 10 or 15 children. In such testing, the examiner must 
still give considerable individual attention to the subjects to make sure that 
directions are followed, see that pages are turned properly in the test book- 
lets, and supervise other procedural details. With one or two assistant ex- 
aminers, somewhat larger groups may be tested if necessary. 

Group tests for the primary level generally cover kindergarten and the first 
two or three grades of elementary school. In such tests, each child is pro- 
vided with a booklet on which are printed the pictures and diagrams con- 
stituting the test items. All instructions are given orally and are usually ac- 
companied by demonstrations. Fore-exercises are frequently included in 
which subjects try one or two sample items and the examiner or proctor 
checks the responses to make certain that the instructions were properly un- 
derstood. The child marks his responses on the test booklet with a crayon or 
soft pencil. Most of the tests require only marking the correct picture out ofa 
set. A few call for simple motor coordination, as in drawing lines that join 
two dots. 

Obviously tests for the primary level can require no reading or writing 07 
the part of the subject. For this reason, they are sometimes described as 
"non-verbal tests." This category should not be confused with the non-lan- 
guage tests to be considered in the next chapter. The latter type requires nO 
language at all, either written or spoken, and is suitable for foreign-speaking 
and deaf as well as for illiterate subjects. Non-language tests for the primary 
level have also been developed for testing special groups of children, but the 
usual primary group test involves extensive use of spoken language. The 
designation "non-verbal" for these tests, although commonly employed. may 
be somewhat misleading, since it can be properly applied only to the test 
content and not to the subject's behavior. For example, tests of verbal com- 
prehension can be administered at these age levels through the use of pic- 
torial content. Thus the child's vocabulary or his sentence comprehension can 
be tested by means of pictures. For this reason, it would seem more accurate 
to refer to these tests by such a term as “pictorial” or "preliterate," rather 
than “non-verbal.” 

One of the best-known group tests for the primary level is the Pintner- 
Cunningham Primary Test (23), which has been in use in different forms 
since 1923. This test is part of the Pintner General Ability Tests, Verbal 
Series (22), which extends through the college freshman level. The Pintner- 
Cunningham is designed for children in kindergarten, grade 1, and the first 
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Test 1. Mark the things that we need when we go out in the rain. 


ark the two things that belong together. 


Test 3. M 


o e 
o e 
o e 


Test 7. Look at each picture; 
like it in the dots. 


wn; make another one 


see how it is dra 


Fig. 32, Illustrative Items from Forms A. B. and C of the Pintner-Cunningham 
Timary Test, (Copyright by World Book Company.) 


half of grade 2. The current revision is available in three equivalent forms, 
A, B, and C. A few illustrative items taken from each of the three forms 


ate reproduced in Figure 32- Each form consists of the following seven sub- 


tests: 


arks all objects in a set that fit a given cate- 


1. Common Observation: Subject m 3 : 
e go out in the rain (Fig. 32, row 1). 


gory, such as all things We need when w 
ject marks the prettiest of three drawings of the 
n in Figure 32, row 2. 


N 


Aesthetic Differences: Sub. 
same object, such as the house show 
3. Associated Objects: In each row of pictures, subject marks the two things that 
belong together, such as the chicken and the egg in Figure 32, row 3. 
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4. Discrimination of Size: Subject marks the items of clothing that are the right 
size for the individual pictured. For each article of clothing, such as shoes, 
hat, gloves, etc., one is too large, one too small, and one is the correct choice. 


5. Picture Parts: This test includes a series of pictures of increasing complexity 
containing children, animals, toys, and other objects. The same objects are re- 
produced outside the picture, mixed in with other objects. Subject marks all 
the objects outside the picture that appear in the picture. 


6. Picture Completion: For each incomplete picture, the subject locates and 
marks the correct missing part among several parts presented. 


7. Dot Drawing: Subject copies drawings made by joining dots, as shown in the 
fourth row of Figure 32. 


A total raw score is obtained by adding the point scores on each subtest. 
By reference to a table of age norms, the mental age corresponding to any 
given raw score can be found. This can be divided by the CA to find the IO 
by the traditional ratio method. Such an IQ, however, is of doubtful signifi- 
cance, since there is no assurance that it will have uniform variability at dif- 
ferent age levels. A procedure for converting raw scores into deviation IQ's is 
also provided in the test manual. Since the SD of these deviation IQ's has 
been set at 16, they are expressed in units that are comparable to those of the 
Stanford-Binet IO. It cannot be assumed, of course, that the scores have the 
same meaning as Stanford-Binet IO's, since the tests differ in content, mode 
of administration, and other characteristics. In all cases, IO's should be ac- 
companied by the name of the test from which they were obtained. An ad- 
vantage of Pintner-Cunningham IO's is that they are comparable with the 
IO's obtained on other parts of the Pintner General Ability Tests for higher 
age levels, since the standard score scales developed for different levels are 
continuous. 

Reliability of the Pintner-Cunningham, found by correlating Forms A and 
B, varied from .83 to .89 within different groups of kinderg s 


grade children.! As in the validation of many group int 
Stanford-Binet and certain me 


arten or primary 
elligence tests, the 
asures of school achievement served as the 
criteria. Correlations of .73, .80, and .88 between Pintner-Cunningham and 
Stanford-Binet were found in three groups of kindergarten and Rosas grade 


children. In a group of 260 first-grade pupils, the Pintner-Cunningham corre- 
lated .63 with reading test scores. 3 


1 Unless pthenv ise alicate, all data reported about a test are t 
Complete references for each test cited can be found at the end S 
C 1 cach C: und a of each cha dates 
included in these references are the terminal publication dates of the latest Rule Rei 
Frequently, parallel forms of the same edition or supplements to the manual appear in different 
years. In such cases, more than one year is given in the publication date i 


aken from the test manual. 
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Other tests containing levels suitable for kindergarten and the primary 
&rades are the Otis Quick-Scoring Mental Ability Tests (21), Kuhlmann- 
Anderson Intelligence Tests (15), California Test of Mental Maturity (25), 
and The Lorge-Thorndike Intelligence Tests (18). Like the Pintner series, 
all of these tests provide scores expressed in comparable terms from the 
primary level through high school or college. All four have been either pub- 
lished or revised since 1952. 


TESTS FOR THE ELEMENTARY SCHOOL LEVEL 


In general, the tests to be considered in this section are designed for use 
from grade 4 through grade 8 or 9. Since literacy is presupposed at these 
levels, such tests afe predominantly verbal in content; most also include 
arithmetic problems or other numerical tests. One of the first group tests con- 
structed for use with school children was the National Intelligence Test (40). 
This test was prepared shortly after the termination of World War I by a 
SToup of psychologists working under the auspices of the National Research 
Council. It was primarily an adaptation for school children of the group in- 
telligence tests developed for Army recruits during the war. Today this test is 
of historical interest only, having been replaced by more recently developed 
Instruments, For many years, however, it was one of the most widely used in- 
telligence tests in the elementary school grades and was also employed in 
many psychological investigations on a wide variety of problems. 

The five series of tests cited in the preceding section (Pintner, Otis, Kuhl- 
mann-Anderson. California, and Lorge-Thorndike ) contain levels suitable 


for the elementary school grades. To these may be added the Henmon- 


Nelson Tests of Mental Ability. revised in 1957 and comprising three bat- 
teries designed for grades 3 to 6, 6 to 9, and 9 to 12, respectively (16). 
Still another campe is provided by the Cooperative School and College 
Ability Tests (SCAT), first developed in 1955 (9). Covering a range from 
Brades 3 to 14, these tests will be more fully discussed in a later section, 
in Connection with their use in testing prospective college students. 
As an illustration of tests for elementary school children, we shall examine 
evel 3 (for grades 4 to 6) of the Lorge-Thorndike tests. The entire series 
Comprises hen ende for kindergarten and first grade, grades 2 to 3, 4 to 6, 7 
to 9. and 10 to 12. The two lowest levels are entirely non-verbal. All other 
£vels contain both verbal and non-verbal parts, yielding separate scores. Ac- 
cording to the authors. however, all parts of the test were designed to measure 
Stract intelligence. defined as “the ability to work with ideas and the relation- 
ips among ideas.” While granting that verbal symbols are the appropriate 
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medium for testing abstract intelligence, the authors have included a parallel 
set of non-verbal tests to provide a more adequate basis for appraising the 
abilities of children with inferior educational backgrounds or with special 
reading disabilities. f . X 

Like all other levels, Level 3 is available in two equivalent forms, A and 
B. The verbal subtests include Sentence Completion, Verbal Classification, 
Arithmetical Reasoning, and Vocabulary. The non-verbal subtests uly 
only pictorial, diagrammatic, or numerical content; they comprise Figure 
Classification, Number Series, and Figure Analogies. Typical items illustrating 
each of the seven tests are reproduced in Figures 33A and 33B. The authors 
recommend that both verbal and non-verbal parts be routinely administered 
to each child for a more comprehensive picture of his abilities. Total time re- 
quired is about 45 minutes for the verbal and 40 for the non-verbal parts. 
Although all subtests have time limits, they are said to be largely power tests. 

Within each subtest, items were selected so as to yield an appropriate 


1. Sentence Completion: choose the word that will make the best, the truest, 
and the most sensible sentence. 
There's no book so : but something good may be found in it. 


A. good B. true C. beautiful D. bad E. excellent 


2. Verbal Classification: think in what way the words in dark type go together. 
Then find the word on the line below that belongs with them. 


cotton wool silk 


A. dress B. sew C. fibre D. linen E. cloth 


3. Arithmetic Reasoning 


A man has to take a 300-mile trip by car. If he goes 40 miles each hour, 
how many miles does he still have to travel after driving 51/2 hours? 


A. 180 mi. B. 100 mi. C. 60 mi. D.2 mi. E. none of these 


4. Vocabulary: choose the word which has the same meaning, or most nearly 
the same meaning, as the word in dark type at the beginning of the line. 


javelin A. bleach B. coffee C. jacket D. rifle E. spear 


Fig. 33A. Typical Items from The Lorge-Thorndike Intelligence Tests, Level 3: 
Verbal Battery. (Reproduced by permission of Irving Lorge and Robert L. Thorndike-) 
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range of difficulty, as well as high internal consistency. Median item-subtest 
correlations in Level 3 range from .44 to .70 for the seven subtests. Norms for 
the complete battery (including all levels) were established by testing about 
136,000 children in 44 communities distributed over 22 states. To increase 
the representativeness of this standardization sample, the communities were 
Selected on the basis of a composite of socioeconomic and educational vari- 


ables previously found to be related to the intelligence test performance of 


Children within a community. Scores are expressed as deviation IQ's, with a 
mean of 100 and an SD of 16. Age. grade, and percentile norms are also 
Provided. It is interesting to note that on this test the differences between 
lower and higher socioeconomic levels are about the same in verbal and non- 


verbal IQ's. 


eae | 
1. Figure Classification: the first three drawings in a row are alike in a certain 
way. Find the drawing at the right that goes with the first three. 
A B G D E 


Za LDAsO: 


eft are in a certain order. Find 


2. Number Series: the numbers at the | 
the number at the right that should come next. 


A. 2 B. 3 C.4 D. 5 E. 6 


243 5 4—- 


3. Figure Analogies: the first two drawings go together in a certain way. Find 
the draaingral ihe right that goes with the third drawing in the same way 
that the second goes with the first. (Two items have been reproduced below, 


one using pictures, the other geometric forms.) 


V 
S 
22 

© 
J^ 


Fig. Ius The Lorge-Thorndike Intelligence Tests, Level 3: 
E- 33B. Typical Items from mission of Irving Lorge and Robert L. Thorndike.) 


nverbal Battery, (Reproduced by per 
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Parallel-form reliability coefficients, found by administering Forms A and 
B about a week apart, were .896 for the verbal and .814 for the non-verbal 
parts of Level 3. These coefficients were found on 724 fifth-grade children. 
Odd-even reliabilities were .940 for both verbal and non-verbal parts. Al- 
though varying somewhat in different portions of the score range, the stand- 
ard error of measurement is about 4 IQ points for the verbal and about 6 
for the non-verbal parts. Apart from the a priori choice of test content de- 
signed to measure the ability to handle abstract concepts, symbols, and rela- 
tions, available evidence of validity centers around correlations with other 
intelligence tests and with tests of educational achievement. For example, in à 
group of 171 sixth-grade pupils, the correlations of total IO with Stanford 
Achievement Tests in Reading and Arithmetic were .87 and .76, respectively. 
Little information on predictive validity is available, none being reported for 
Level 3. Empirical data on validity as a whole are meager, but are gradually 
being accumulated as the test continues in use. 

The correlations between verbal and non-verbal parts, separately com- 
puted in fourth-, fifth- and sixth-grade groups, ranged from .66 to .68. Al- 
though there is considerable overlap, it is apparent that somewhat different 
functions are measured by verbal and non-verbal IQ's. A factor analysis of the 
intercorrelations among the seven subtests in Level 3 yielded evidence of à 
large verbal factor through the verbal tests and a large non-verbal factor 
through the other three, thus further indicating that these two parts measure 
distinct functions. In view of these findings, one wonders how well the non- 
verbal IO may predict such a highly verbal criterion as school achievement 
in the case of non-readers and other educationally handicapped children. 
Empirical data on the predictive validity of the non-verbal IO against aca- 
demic criteria are needed to check on the effectiveness of the tests for such 
purposes. 

It is apparent that our knowledge of what these tests measure and how 
their scores are to be interpreted would benefit from further empirical valida- 
tion. The chief strengths of the tests stem from the sound theoretical rationale 


underlying choice of content, the size and representativeness of the stand- 
ardization sample, the high reliability of the IQ's, and the e 


| generally superior 
quality of test-construction procedures followed in developing the tests. 


TESTS FOR HIGH SCHOOL STUDENTS AND 
UNSELECTED ADULTS 


All of the test series mentioned in the preceding section include levels ap- 
propriate for testing high school students. However, it should be noted that: 


à 
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as the upper limit of applicability of any test is approached, the test may not 
be as satisfactory a measure of individual differences as it is nearer the center 
of its range. Thus the Pintner, Kuhlmann-Anderson, Henmon-Nelson, and 
Lorge-Thorndike, which extend through grade 12, may not discriminate ade- 
quately among high school seniors. The Otis, California, and SCAT, extend- 
Ing into the college level, would be expected to provide a more adequate ceil- 
ing for testing the brighter high school seniors. On the other hand, tests 
designed for the high school level are suitable for unselected adults in the 
8eneral population. Since the proportion of adults who have attended college 
is still small, high school tests usually provide sufficient ceiling for unselected 
adult groups, while tests constructed for college students may be too difficult 
and would probably fail to differentiate at the lower end of the distribution. 

^ widely used and carefully constructed test prepared specifically for the 
high school level is the Terman-McNemar Test of Mental Ability (28), 
Which in 1941 replaced the earlier Terman Group Test. The Terman-Mc- 
Nemar Test is designed primarily for grades 7 through 12, although the au- 
thors report that it may also be used in grade 6 and in the first year of col- 
lege. It is predominantly a measure of verbal comprehension, consisting of 
Information, Synonyms, Logical Selection, 


the following seven subtests: 
d Best Answer. Two numerical tests 


Classification, Analogies, Opposites, an s 
that had been included in the earlier form were eliminated from the revised 


form in order to make the test more homogeneous and the scores less am- 
biguous. The specific instructions and sample items employed with each of 
the subtests of the Terman-McNemar Test are reproduced in Figure 34. 
These items are not scored and are considerably easier than those occurring 
In the test proper. ] : 

The Terman-McNemar Test provides two equivalent forms, available in 
both hand-scored and machine-scored editions. Each form requires approx- 
imately fifty minutes to administer and is described in the manual as being 
Sssentially a power test, the time limits allowed for each subtest being ade- 
quate to enable most subjects to attain their maximum. scare. Norms were 
€stablished through a carefully conducted, nationwide testing program, involv- 
ing 200 communities in 37 states. Scores can be expressed in terms of per- 
Centiles, mental ages, and deviation 1Q’s with an SD of 16 points. The last 
type of score is, of course, the soundest of the three measures and is to be 


Preferred for most purposes. 
The reliability coefficient of t 

Was found to be .96 by both sp 

Telation of .91 is reported between t 
Toup Test. Individual scores on the tw 


he total test, adjusted for a single age level, 
lit-half and parallel-form techniques. A cor- 
he present test and the earlier Terman 
o tests are not, however, directly com- 
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TEST 1. INFORMATION 


Mark the answer space which has the same number as the word that makes the sentence TRUE. 
Mark the ans a 


Samere. Our first icant was —ÀÀ2—— 24 $5. Stone 


TEST 2. SYNONYMS 


Mark the answer space which has the same number as the word which has the SAME or most nearly 
Mark the ans ace which 
the same meaning as the beginning word of cach line. 


SawPLE. correct — 1 neat 2 fair 3 right 4 poor 5 good 


TEST 3. LOGICAL SELECTION 


Mark the answer space which has the same number as the word which tells what the thing ALWAYS 
has or ALWAYS involves. 


. A cat always has . 
Sewer Acatahayslas o Cs ey milk 4 mouse hak 


TEST 4. CLASSIFICATION 


In each line below, four of the words belong together. Pick out the ONE WORD which does not 
belong with the others, and mark the answer space bearing its number. 
1 dog 2 cat 3 horse 4 chicken 5 cow.. 
SAMPLES. j 
6 hop 7run 8stand 9 skip 


TEST 5. ANALOGIES 


Study the samples carefully. 
Ear is to hear as eye is to 


& 1 cry 2 glasses 3 spy 4 wink 5 see 
SAMPLES. j È 
Hat is to head as shoe is to 


6 arm 7 leg 8 foot 9 fit 10 glove 
DO THEM ALL LIKE THE SAMPLES. 


TEST 6. OPPOSITES 


Mark the answer space which has the same number as the word which is OPPOSITE, 


or most nearly 
opposite, in meaning to the beginning word of each line. 


SawrLE. north — 1 hot 2 east 3 west 4 down 5 south 


TEST 7. BEST ANSWER 


Read each statement and mark the answer space which has the same number as the answer which you 
think is BEST. 


Sampe. We should not put a burning match in the wastebasket because 
1 Matches cost money. — 2 We might need a match later. 
3 It might go out. 4 It might start a fire. 


Fig. 34. Sample Items from Terman-McNemar Test of Mental Ability. (Copyright 
by World Book Company.) 


parable. because of differences in test content, standardization sample, and 
method of computing IQ. The principal evidence for validity derives from the 
item analysis, which was conducted on a total of 1200 pupils in grades 7, 9. 
and 11. One criterion for item selection was grade differentiation, or increase 
in percentage of subjects passing an item from grade 7 to 9, and from grade 
9 to 11. The other criterion was the correlation between each item and total 
score on the entire test. No other statistical data on validity are reported in 


the manual. It is pointed out, however, that during the many years when the 
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earlier form was widely used in high schools, close correspondence was ob- 
served between test scores and indices of academic achievement, such as 
graduation with honors. 

The two principal tests designed for general screening and placement of U.S. 
military personnel in World Wars I and II also belong in the present cate- 
gory. Both tests have subsequently appeared in civilian editions suitable for 
high school students and for unselected adults. The original military forms of 
the Army Alpha (cf. 42. Part I, Ch. 1-4) were validated in terms of such 
criteria as amount of schooling and ratings of intelligence by officers. In the 
development of this test. preliminary forms were administered, not only to 
military personnel, but also to à number of other groups, including college 
Students, school children, and institutionalized mental defectives. Several 
other criteria were employed in such preliminary testing. For example, the 
Criterion of contrasted groups Was applied by comparing the distributions of 
Alpha scores obtained by college students, officers. enlisted men, and men- 
tally defective adults. For school children. Alpha scores were checked against 
Such criteria as Stanford-Binet MA, chronological age. grade, school marks, 
and teachers’ ratings of pupils’ intelligence. u 

Most of the indices of validity showed the Alpha to correlate fairly well 
With the criteria employed. For example, à correlation ai .82 was found 
between a preliminary form of Alpha and school grade within a group of unse- 
lected 13-year-olds, and a correlation of .86 within a group of unselected 14- 
year-olds. Correlations with the Stanford-Binet ranged from .58 to .88. Cor- 
relations of the final form with ratings © : 
-45 to .67, and correlations with amount of schooling were in the .60's and 
70's. Correlations of subtests with total scores and intercorrelations among 
Subtests were also employed in the final selection of subtests. 

After World War I, the Army Alpha was released for general use. Several 
Tevisions were subsequently developed for civilian purposes and were widely 
administered, especially in testing applicants for industrial jobs (cf. 1 D. One 
9f the current adaptations of this test is the Modified Alpha Examination, 
Form 9, developed by Wells (37). Commonly known as “Alpha 9." this test 
COnsists of four numerical and four verbal subtests. from which can be ob- 
fained separate N and V scores. as well as a combined N T V score. These 
estes together with their corresponding N or V designations, are listed be- 
Ow: T 


f intelligence by officers ranged from 


A. Addition (N) E. Number series completion (N) 
- Followi 5 directions (v) F. Disarranged sentences (V) 

. istinc. watten S n G. Finding largest common divisor (N) 

k yi ene es t H. Synonym-antonym (V) 
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Percentile norms are provided for boys and girls in each year of done 
separate norms being given for N, V, and total scores. The manual a E es 
cludes supplementary norms based on smaller samples of seventh- and eigh P 
grade pupils, on engineers employed in a single large airplane company, y 
on men applying for executive positions. Total score reliabilities of about : 
were obtained when Alpha 9 was correlated with comparable earlier forms. 
No evidence of validity is cited in the manual other than correlations with 
other group intelligence tests. The mass of data accumulated over the years 
with earlier forms, however, contributes to the construct validation of the 
test. 

During World War II, the Army General Classification Test (AGCT) was 
developed to serve many of the functions for which the Army Alpha had 
been used in the earlier war. The AGCT was administered to over ten mil- 
lion inductees. In 1945, when this test was replaced by a revised edition, the 
earlier form was released for civilian use (2). The AGCT contains an equal 
number of vocabulary, arithmetic reasoning, and block-counting items. Some 
of the easy block-counting items used for demonstration purposes are repro- 
duced in Figure 35. The inclusion of equal proportions of verbal, numerical, 
and spatial content in this test reflects the influence of the intervening Te- 
search on factor analysis. The different types of items are arranged in spiral- 
omnibus form, with blocks of 5 or 10 items of each type following each other 
in order of increasing difficulty. This layout permits the administration of the 
entire test with a single time limit. The test proper is preceded by three 
pages of practice items, including 10 items of each type. The current civil- 
ian form is available in both a machine-scoring and a self-scoring edition, 
the latter employing a pin-punch answer sheet. 

AGCT norms were derived from data obtained during the military ap- 
plication of the test. Both percentiles and AGCT st 
It will be recalled that the latter were adjusted so as to yield a mean of 100 


and an SD of 20 points (cf. Ch. 4). Retest reliability is .82; split-half and 
Kuder-Richardson reliabilities cluster about .95 (cf, 35), 
the | 


andard scores are given- 


It is possible that 
atter value is spuriously high because of the influence of speed upon test 
scores, although it is stated in the manual that the time limit allowed is suffi- 
cient for most subjects to reach their upper limit of. difficulty. 

Several measures of validity were found on the basis of the military sam- 
ples tested. AGCT scores correlated .73 with amount of schooling. Correla- 
tions with many other tests are reported, some of which are extremely high. 
For example, a correlation of .90 was found with Army Alpha, and one of 
.83 with the Otis Higher Mental Ability Examination. Further data on valid- 


ity were provided by correlations with performance in military training 
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schools for various : -— 
"^ ce various occupational specialties, such as that of clerk, radio oper. 
ator, an ids me cdm g > = 
Finn mechanic. as well as by correlations with performance in officer 

ate A ; 
schools. The median AGCT scores of men who had been employed 


in diff. NH d 
erent civilian occupations are likewise given. 


-TIP- How many blocks? 


How many blocks? 


How many blocks? 


How many blocks? 


How many blocks? 


enawlonworn 


How many blocks? 


ems from the Army General Classification Test 


Fi 
ig. 35. Sample Block-Counting It i 
rch Associates.) 


Reproduced by permission of Science Resea 
i n should also be made of the subsequently developed Armed 
by all Pane Test (AFQT). sad pars prepara cooperatively 
Sach the armed services for E m of sr ime a piovuding 
crits (5. 6 vith an equitable dist : y i ; ng its quota of Te 

. 6, 14, 36). The AFQT includes vocabulary, arithmetic reasoning, 


and ; ] 
Spatial relations items. the last-named involving the recognition. per- 
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i anipulation, and analysis of relations in two and three dimensions. 
ees <a en the basis of difficulty level, as well as on the basis of 
"adn rr nia with subtests and total test scores. The AFQT has replaced 
the AGCT as well as other screening tests formerly used by the yos 
services. Following the general preliminary screening by micans of the AFQ : 
each service now administers its own classification batteries, on the basis o 
which the inductees are assigned to particular specialties within that service. 


TESTS FOR COLLEGE STUDENTS AND SUPERIOR ADULTS 


A number of tests have been specially developed for use in the admis- 
sion, placement, and counseling of college students. An outstanding example 
is the Scholastic Aptitude Test (SAT) of the College Entrance Examination 
Board (7, 12). Several new forms of this test are prepared each year, à 
different form being employed in each administration. Separate scores are re- 
ported for the Verbal and Mathematics sections of the test. A shorter compara- 
ble form, known as the Preliminary SAT, has also been administered since 
1959. Generally taken at an earlier stage, this test provides a rough estimate 
of the high school student's aptitude for college work and has been employed 
for educational counseling and other special purposes. Both tests are restricted 
to the testing program of the College Entrance Examination Board. 

Another college-level test prepared for restricted use in a special program 
is the Selective Service College Qualification Test (SSCOT). First adminis- 
tered in 1951, this test was designed to aid in identifying college students 
with high scholastic aptitude whose military service might profitably be de- 
ferred in order to permit them to complete their education (cf. 13). Utilizing 
a variety of item types, the test gives equal emphasis to verbal ability and 
quantitative reasoning. In content, it provides a balanced selection of ma- 
terial from the major areas of academic instruction. 

A number of tests designed for college-bound high school seniors and for 
college students are available for distribution to counselors and other quali- 
fied persons. Among them may be mentioned the Ohio State University PSY" 
chological Test (34), covering only verbal content: the College Placement Test 
(8), providing verbal and quantitative Scores; and the College Qualificatio? 
Tests (3), yielding scores in verbal and numerical aptitudes as well as 1" 
broad areas of academic achievement. 

For many years, one of the most widely used instruments for the testing of 
entering college students was the American Council on Education Psycholog- 
ical Examination for College Freshman, commonly called the ACE. Availa- 
ble also in a high school form, this test yielded a Linguistic (L) and a Quanti- 
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tative (Q) score. A large number of follow-up studies were conducted to 
determine the validity of the ACE in predicting college grades. Although the 
results varied widely with the level and heterogeneity of the sample and 
with the nature of the courses, correlations with four-year grade-point aver- 
ages clustered around .45 (4; 26, pp. 120 ff.). In general, the validities were 
Somewhat lower than those reported for other, more recent instruments. 
Moreover, the L and Q scores seem to be factorially complex and hence diffi- 
cult to interpret. Speed also plays an unduly prominent part in determining 
these scores. Following the publication of the 1954 edition, the ACE, which 
Since 1948 had been prepared by the Cooperative Test Division of Educa- 
tional Testing Service, was discontinued and has been superseded by the 
Cooperative School and College Ability Tests (SCAT). 

As was noted in an earlier section of the chapter, SCAT (9) covers a 
range from the fourth grade of elementary school through the college sopho- 
more year. At each of the five levels covering this range, the tests are availa- 
ble in two equivalent forms, A and B. Oriented specifically toward the 
Prediction of academic achievement, all levels yield a verbal, a quantitative, 
and a total score. The verbal score is based on two tests, Sentence Under- 
Standing (1) and Word Meanings (III); the numerical score, on Numerical 
Computation (11) and Numerical Problem Solving (IV). Figure 36 shows a 
Sample item from each of the four parts at Level 1, suitable for college fresh- 
men and sophomores. Administration of any one level requires approxi- 


Mately two class periods. 
In line with current trends in testing theory, SCAT undertakes to measure 


“developed abilities.” This is simply an explicit admission of what is more or 
less true of all intelligence tests, namely that test scores reflect the nature and 
amount of schooling the individual has received rather than measuring “ca- 
Pacity” independently of relevant prior experiences. Accordingly, SCAT 
draws freely upon word knowledge and arithmetic processes learned in the 
Appropriate school grades. In this respect. SCAT does not really differ from 
Other intelligence tests, especially those designed for the high school and col- 
ege levels—it only makes overt a condition sometimes unrecognized in other 
tests, 

al scores from all five SCAT levels are ex- 
ermits direct comparison from one level 


be converted into percentiles for the 


Verbal, quantitative, and tot 
Pressed on a common scale which p 


to another, These scores can in turn O per i 
aPpropriate grade derived from a carefully chosen nationwide standardiza- 
tion sample. A particularly desirable feature of SCAT scores is the provision 
9f a percentile band rather than a single percentile for each obtained score. 


is percentile band, illustrated in Figure 37, covers a distance of approx- 
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Part I. Sentence Understanding: select the missing word by deciding which 
one of the five words best fits in with the meaning of the sentence. 


A knowledge of history is an antidote for sectionalism and narrow nationalism, 
leading one instead to realize the eternal ( ) of peoples. 


A. differences B. struggle C. self-consciousness D. interdependence 


E. evolution 


Part ll. Numerical Computation: choose the correct answer, using scratch 
paper if necessary. 


2)3 pounds 2 ounces 


A. pound 11/4 ounces 
B. 1 pound 2 ounces 

C. 1 pound 6 ounces 

D. 1 pound 9 ounces 

E. none of these 


Part Ill. Word Meanings: pick the word or phrase whose meaning is closest to 
the word in large letters. 


retaliate 


A. buy and sell profitably 

B. whisper insults 

C. repay evil with evil 

D. recheck items in a list 

E. return damaged merchandise 


Part IV. Numerical Problem Solvin 


g: choose the correct answer using scratch 
paper if necessary. 


If the average of numbers 1 through 9 is multiplied by 9, the result is 


A.36 
B. 40.5 
C. 45 
D. 54 
E. 90 


Fig. 36. Typical Items from SCAT, Level 1, for Grades 13 and 14. (Reproduced 
by permission of Cooperative Test Division, Educational Testing Service.) 


imately one standard error of measurement on eit 


ing percentile. Such a distance, technicall 
fidence interval, 


her side of the correspond- 
y known as a 68 per cent con- 


Tepresents the range of percentiles within which the individ- 


ual's "true" score will fall 68 out of 100 times (roughly a 2:1 chance). AS 
explained in Ch 


apter 4, the error of measurement is a concrete way of taking 
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SCHOOL AND COLLEGE ABILITY TESTS 
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Form 


Converted 
Score 


Eb HZBOXt fa: 


le. Illustratin. 
Division, 


oe 37. SCAT Student Profi 
Permission of Cooperative Test 


g Use of Percentile Bands. (Reproduced 
Educational Testing Service.) 
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the reliability of a test into account when interpreting an individual's pP 

Thus if two students were to obtain total SCAT Spores that fall ks is 
percentile bands 55-68 and 74-84, we could cóncltide with fair con 3 E 
that the second actually excels the first and would continue to do so on at 
test. Percentile bands likewise help in comparing a single individual s relative 
standing on verbal and quantitative parts of the test. From Figure 37, tor ex- 
ample, we would conclude that the student whose scores are plotted is not 
significantly better in verbal than in quantitative abilities, because his per- 
centile bands for these two scores overlap. 

Reliability coefficients for verbal, quantitative, and total scores were sepa- 
rately computed within single grade groups by the Kuder-Richardson tech- 
nique (cf. Ch. 5). It will be recalled that this is a measure of interitem con- 
sistency within a single form administered once. The reported reliabilities are 
uniformly high. For the separate grade groups investigated, from grade 319 
grade 13, total score reliabilities are all .95 or .96; verbal and quantitative 
relabilities vary between .88 and .94. These reliabilities may be spuriously 
high because the tests are somewhat speeded. It is reported that, in some of 
the samples tested, the numbers of subjects completing all items were as low as 
65 per cent and 80 per cent for the two verbal tests and as low as 48 per 
cent and 60 per cent for the two quantitative tests. Under these circumstances, 
equivalent-form reliability would seem more appropriate. When the neces- 
sary data have been accumulated, these equivalent-form reliabilities will pe 
reported in supplements to the manual, to be issued from time to time. It is 
possible that the computed values of the errors of measurement —and hence 
the percentile bands—may require revision when such equivalent-form reli- 
abilities become available. 

A variety of sources provide information about what SCAT measures: 
Through preliminary experimentation, the four subtests illustrated in Figure 
36 were chosen from nine subtests, each representing a different item typ* 
designed to sample abilities required for academic succe: 
based on the correlations of each subtest with total 


grades, and mathematics grades in groups of ninth- 
dents. Intercorrel 


ss. This selection was 
grade averages, English 
and twelfth-grade stu 


ations among the subtests were also considered, in order 


to maximize differences between verbal and quantitative measures. Items for 


the four types of subtests finally chosen were next selected on the basis of 
item-subtest correlations and appropriateness of difficul 


ty level. This item 
analysis was condu 


cted on large samples of students representative of the 
population for which the test was being developed. In the final forms, correla- 
tions between verbal and quantitative scores dropped from .71 in the fifth 
grade to .53 in the thirteenth. Such evidence for increasing differentiation of 
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abilities with increasing age and educational level is consistent with findings 
on the organization of abilities (cf. 1, Ch. 11). 

Special interest attaches to the correlation of SCAT with the ACE, which it 
was designed to replace. Total scores on the two tests correlated .88 and .87 
in high school and college samples, respectively; verbal-linguistic scores corre- 
lated .89 and .85; and quantitative scores correlated .75 and .76. Correla- 
tions with the College Board SAT are also of interest because of the possibil- 
ity of predicting SAT scores from earlier SCAT scores, as well as the use of 
SCAT scores in counseling college-bound high school students. Correlations 
between these two tests were also found to be in the high .70's and .80's. 
Charts have been prepared for predicting a student's SAT score from his 
eleventh- or twelfth-grade SCAT score. 

In view of the stated purpose for which SCAT was developed, its predictive 
Validity against academic achievement is of prime relevance. A limited num- 


ber of studies at elementary, high school, and college levels have so far 


yielded correlations in the .50's and .60's with grades and correlations in the 


80's with achievement test scores. Intervals between test and criterion ranged 
from a semester to a year. Long-range validation studies at all levels are in 
Progress and will be reported in Supplements to the Technical Report dis- 
tributed by the test publishers. The eflectiveness of the verbal and quantita- 
tive scores as differential predictors of grades in specific courses remains 
Uncertain, English grades tend to correlate higher with verbal than with quanti- 
tative scores, and mathematics grades higher with quantitative than with 
Verbal, but the differences are small and not entirely consistent. Moreover, 
total scores often yield the highest correlations with all types of courses. 

On the whole, SCAT is an excellently planned and well-constructed test. 
Among its chief assets are the carefully chosen standardization samples, the 
Uniform score scale for all levels, the introduction of percentile bands, and 
edictive validity. Major gaps to be filled include 
in place of the questionable single-form reliabil- 
ities, and more validity data, especially for long-range predictions. 

The practice of testing applicants for admission to college has subse- 
quently been extended to include graduate and professional schools. Most of 
the tests designed for this purpose. however, represent a combination of gen- 
eral intelligence and achievement tests. A well-known example is the Gradu- 
ate Record Examination (GRE). administered to applicants or entering stu- 

ents in a large number of graduate schools. Although one porkon of this 
test corresponds closely to the usual scholastic äptitude or inteligence test for 
Superior adults, other parts measure the student s mastery of specific subject- 
Matter areas. Similar tests Or batteries of tests have been assembled for the 


the Promising evidence of pr 
“quivalent-form reliabilities, 
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selection of applicants to professional schools, such as schools of medicine, 
law, dentistry, and the like. This type of test will be covered in Chapter 175 
following the discussion of achievement tests. 

One test designed for the selection of graduate students that can be prop- 
erly classified in the present chapter, however, is the Miller Analogies Test 
(19). Consisting of complex analogies items whose subject matter is drawn 
from many academic fields, this test has an unusually high ceiling. Although 
a fifty-minute time limit is imposed, the test is primarily a power test. The 
Miller Analogies Test was first developed for use at the University of Minne- 
sota, but later forms were made available to other graduate schools. Its ad- 
ministration, however, is restricted to licensed centers, and rigid controls are 
exercised over the test materials in order to prevent coaching and protect 
the security of the test. One form is available for testing high-level industrial 
personnel. 

Percentile norms on the-Miller Analogies Test are given for several groups 
of graduate students in different fields and different universities, as well as 
for a few professional school groups. Marked variations in test performance 
are found among these different samples. The median of one group, for 
example, corresponds to the 90th percentile of another. Odd-even relia- 
bility coefficients of .92 to .94 were found with different groups of gradu- 
ate students, and alternate-form reliabilities ranged from .85 to .89. Correla- 
tions with graduate course grades and with performance on comprehensive 
examinations vary widely in different institutions and departments, but more 
than half fall at or above .40, several being in the .60's and .70’s. In general, 
its validity appears to be as good as that of other longer tests, or better. Cor- 
relations in the .70’s and .80’s have been reported between the Miller Analo- 
gies Test and the various parts of the GRE, a test that requires several hours 
to administer. f 

Another test that provides sufficient ceiling for the examination of highly 
superior adults is the Concept Mastery Test (27). Originating as a by- 
product of Terman’s extensive longitudinal study of gifted children, Form ^ 
of the Concept Mastery Test was developed for testing the invelieence of the 
gifted group in early maturity (29). For a still later fallos, when the 
gifted subjects were in their mid-forties, Form T was urea (30); This 
form, which is somewhat easier than Form A, was subsequently release 
for more general use. The Concept Mastery Test consists of both analogies 
and synonym-antonym (same-opposite) items. Like the Miller Analogies 
Test, it draws on concepts from many fields, including physical and biological 
sciences, mathematics, history, literature, music, and others. Although pre- 
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dominantly verbal, the test incorporates some numerical content in the analo- 
gies items. 

Percentile norms are provided for graduate students, college seniors apply- 
ing for fellowships, and selected adult groups. but the samples are small and 
Such norms must be regarded as tentative. Alternate-form reliabilities range 
from .86 to .94. Scores show consistent rise with increasing educational level 
and moderately high correlations with other intelligence tests in superior adult 
groups. Available evidence of predictive validity, however, is meager. 

Mention may also be made of the CAVD. an early test developed by E. L. 
Thorndike and his associates at the Institute of Educational Research, Teach- 
ers College, Columbia University (31. 32). This test derives its name from 
its four constituent parts. Completion, Arithmetic, Vocabulary, and Direc- 
tions, The CAVD is highly verbal in content, even the arithmetic problems 
often demanding a high level of reading comprehension. Although designed 
tal age of 3 to superior adult, this test has been 
Used chiefly at its upper levels, which are suitable for college and graduate 
Students, The CAVD is a carefully constructed, pure power test with an un- 
usually high ceiling. It has proved helpful in identifying college and graduate 
students who are handicapped on speed tests, such as the slow readers or 
the overcautious. Intercorrelations among its four alternate forms range from 


-88 to .93, 


to cover a range from a men 


ABBREVIATED TESTS FOR INDUSTRIAL SCREENING 


A number of intelligence tests have been specially developed for the rapid, 
1 personnel. Several of these tests represent 


Others have been specifically constructed 
interesting innovations in testing proce- 
n and scoring are simplified and stream- 
the content face validity in an 


Preliminary screening of industria 
abridged versions of earlier tests. 
Tor the purpose, a few introducing 
dure, In all these tests, administratio 
lined, and an effort is usually made to give 


Industria] setting. i : 
It should be clearly recognized that general screening tests may have fairly 


high validity for some jobs and little or no validity for others. The type of 
behavior sampled by these tests is undoubtedly far more relevant to some 
lypes of jobs than to others. Jobs cannot simply be put into a hierarchy in 


terms of the amount of “intelligence” required, because the type of “intelli- 
ence” needed for different jobs varies. For many occupations, especially 
tests of special aptitudes will serve as better 


those requiring mechanical skills, : : 
E ill the general intelligence tests. This point 


Predictors of achievement than W 
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is not always adequately stressed in the test manuals. In fact. some is manu- 
als tend to create the erroneous impression that the general screening test 
can be used to predict success in almost every type of industrial works . 

An early test that has been widely used in personnel screening is the Otis 
Self-Administering Test of Mental Ability: Higher Examination (20). which 
was used as a basis for developing the highest level (Gamma) of the Otis 
Quick-Scoring Mental Ability Tests. In industry, this test has been used m 
screening applicants for such varied jobs as those of clerks, calculating-ma- 
chine operators, assembly-line workers, and foremen and other supervisory 
personnel. Dorcus and Jones (11) cite 36 validation studies in which the Os 
test was checked against an industrial criterion. Not all of these studies 
yielded significant validity coefficients, of course, but many of them did. In 
semiskilled jobs, the Otis test correlates moderately well with success in learn- 
ing the job and ease of initial adaptation, but not with subsequent jet 
achievement (26). This would be expected for jobs that are largely routine, 
once they are learned. Also, for high-level professional personnel, who 
represent a select group in terms of academic achievement, correlations be- 
tween Otis scores and criteria of job success are usually negligible, since this 
test does not discriminate adequately at the upper levels (cf. 11). 

The Wonderlic Personnel Test (41), available in four forms, is an adapta- 
tion and abridgment of the Otis Self-Administering Test. Despite its time 
limit of only twelve minutes, it yields correlations of .81 to .87 with the orig- 
inal, longer Otis test. Parallel-form reliability coefficients between .82 and 94 
are reported. The manual provides percentile norms on large industrial sam~ 
ples, totaling approximately 37,000 cases. Correlations with industrial criteria 
vary widely with the nature of the job, the sample tested, and the type of 
criterion measure employed (cf. 11). The test appears to have the highest 
validity in the selection of clerical workers. 

Another abbreviated adaptation of an earlier, widely used test is the 
Thurstone Test of Mental Alertness (33). This test is available in two equiV- 
alent forms, each consisting of 126 multiple-choice items of different types 
arranged in spiral-omnibus order. With a time limit of twenty minutes, the 
test yields O and L scores, based on the quantitative and linguistic items: 
respectively, as well as a total score. Separate percentile norms for these 
three scores are provided for various educational and occupational samples: 
although some of the groups employed for this purpose are small. Alternate- 
form reliability was found to be close to .90. Validity is reported principally 
in terms of relationships with supervisors’ ratings in a number of executive» 


sales, and clerical groups. Many of these samples are small, and the relations: 
although significant, provide meager evidence of validity. 
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A more recently developed test is the Wesman Personnel Classification 
Test (38). Like most current intelligence tests. it yields Verbal, Numerical, 
and Total scores. The first is based on an cighteen-minute verbal analogies 
test, in which each item contains two blanks, as illustrated in Figure 38. The 


Example 3. ..... is to night as breakfast is to 


1. flow 2. gentle 3. supper 
A. include B. morning C. enjoy D. corner 


Supper is to night as breakfast is to morning. So you should 
have written 3B on the line at the right. 


s from the Wesman Personnel Classification Test. 


Fig. 38. Sample Analogies Item 1 
ychological Corporation.) 


(Reproduced by permission of The Ps 


Numerical score is derived from a ten-minute arithmetic computation test 
Whose items were designed so as to put a premium on ingenuity and ability 
to perceive numerical relations. Two parallel forms are available. Percentile 
norms on each of the three scores are reported for groups ef students, job 
applicants, and employees, each group including from 93 to 1476 cases. 
Parallel-form reliability coefficients for V, N, and T scores fall mostly in the 
-80's, Correlations of V and N scores vary from .25 to .57, indicating that 
the Overlap of the two parts is small enough to justify retention of separate 
Scores, 

Correlations of the Wesman test with the Otis and Wonderlic range from 
68 to .84. Mean scores on the test show progressive rise with increasing 
educational and occupational level in the groups compared. Correlations with 
criteria of vocational success, usually based on supervisors ratings, range 
from .29 to .62. From the nature of the items, as well as from the distribution 
Of scores reported for various groups, it appears that this test may be better 
Suited for higher-level than for lower-level personnel. It also seems likely that 
the predominantly academic content of the items would not hold the interest 
Of lower-level job applicants and would lack face validity for them. 

As a final illustration of rapid screening tests, we may consider the Oral 
Directions Test (17), in which both directions and test items are presented 
On either a phonograph or tape recording. This insures more uniformity 
than would be possible through oral administration by different examiners. 
Requiring a total time of fifteen minutes, this test bas a split-halt reliability of 
Approximately 90, Percentile norms are reported for various industrial sam- 
Ples and educational groups. Correlations of an earlier, longer form with the 
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Alpha 9 and Otis tests clustered around .80. Older persons may be somewhat 
handicapped on the Oral Directions Test because of its dependence on audi- 
tory discrimination and speed. It is rather heavily weighted with perceptual 
and spatial items, and also depends to a considerable extent upon immediate 
memory for auditory instructions. Available data suggest that it discriminates 
somewhat better at the lower intellectual levels and may be particularly use- 
ful in screening applicants for such jobs as general laborer, maintenance and 
service worker, and messenger. In its later, shortened version, the Oral 
Directions Test is part of a short, low-level intelligence battery, which also 
includes a five-minute Verbal Test (39) and a twenty-minute Numerical 
Test (10). A Spanish version of the Oral Directions Test has also been pre- 
pared for use with Puerto Ricans and other Latin American groups (cf. 24). 
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CHAPTER 10 


Performance and Non-Language Tests 


The tests brought together in this chapter include both individual and group 
Scales. They have been developed primarily for use with subjects who cannot 
Properly or adequately be measured with such instruments as the Binet scales 
Or the group tests considered in the preceding chapter. Among the special 
8roups for which performance and non-language tests are required may be 
mentioned the deaf, the speech defective, the illiterate, and the foreign- 
Speaking. 

Owing to their general deficiency 
handicapped on verbal tests, even W 


in linguistic development, the deaf are 
hen the verbal content is visually pre- 


Sented (cf, 5, pp. 145-147; 61). Similarly, the child who has a serious speech 
defect or whose speech development is retarded for any reason will be un- 
able to take many of the tests on such a scale as the Stanford-Binet, which 
Tequire oral replies. Such a child, moreover, may be too young or too re- 
tarded educationally to take a written test. The illiterate of any age cannot, 
of course, take the usual individual or group test that calls for a certain 
amount of reading and writing. Children with special reading disabilities 


Would also fall into this category- 

Persons with a foreign-langu 
Special difficulties in taking the C 
test, Although the bilingual individual 


age background may likewise experience 
ommon, predominantly verbal intelligence 
may have sufficient mastery of English 
lo communicate on ordinary matters and even to attend an English-speak- 
ing school, he may be handicapped when taking a verbal test in English. Such 
a person may lack the monolingual's vocabulary range, verbal fluency, or 
facility in handling verbal relations in English. Studies on American-born 
School children of foreign parentage. for example, often indicate a special 


deficiency on verbal tests. 

The effects of bilingualism 
adequately summarized in any 
25). Under certain conditions, 


are varied and complex. They cannot be 
simple generalization (cf. 5, pp. 558-561; 
intellectual development may be aided by 
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bilingualism; under other conditions, it may bë seriously īetařded; Emotional 
as well as intellectual factors probably contribute to the specific effects of 
bilingualism in particular cases. In testing any bilingual groups, such as im- 
migrants or the children of immigrants, however, the possible influence of 
language handicap on test performance must be given serious consideration. 
It cannot be generally assumed that such individuals can be adequately meas- 
ured with a verbal test, despite their apparent mastery of English. 

Performance and non-language tests have also been developed for use in 
intercultural comparisons. This application would include not only the testing 
of persons in different nations and in preliterate cultures, but also the com- 
parison of subcultures within a single country. For example, urban and rural 
groups, as well as individuals reared in different socioeconomic levels, may 
not have been equally exposed to the sort of educational experiences pre- 
supposed by most verbal tests. For these reasons, it has been argued that the 
performance or non-language type of test may be more suitable for compari- 
sons among these varied groups, since such tests utilize content that is more 
nearly common to the different groups. The implications of this contention 
will be examined more fully in a later section of the chapter. 

Another purpose for which performance and non-language tests are em- 
ployed is to supplement the usual type of intelligence test. Certain individuals 
may score poorly on verbal tests for special reasons. Thus the shy, inarticu- 
late child, or the child who feels discouraged when confronted with verbal 
tests because of his repeated school failures, may perform more satisfactorily 
on less academic tasks. On the other hand, the “verbalist” type of individual 
may obtain a deceptively high score on certain verbal tests, although his un- 
derstanding of most problems may be very superficial and his practical judg- 
ment may be seriously deficient. It is now generally recognized that perform- 
ance or non-language tests are not simply a substitute for verbal tests. Each 
type of test taps somewhat different abilities. Together they provide a more 
complete picture of the individual and serve as mutual correctives in the 
evaluation of his test performance. 

The first section of this chapter will be concerned with performance tests. 
Such tests involve largely the manipulation of objects, rather than oral or 
written responses. All are designed essentially for individual administration. 
Non-language group tests will be covered in the second section. Although 
employing paper and pencil, these tests require no knowledge of reading OF 
writing and may be administered without spoken language if necessary. Some 


of the tests classified in this category do require some oral instructions, but 
such instructions are sufficiently simple and general that the 


y could presuma- 
bly be translated into another langu 


age without appreciably altering the diffi- 
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culty of the test. In any event, an attempt is made to reduce and simplify 
oral directions and to rely upon demonstrations and sample items. 

Another type of test, to be considered in the third section, has been desig- 
nated as “culture-free.” Such tests have been developed primarily for cross- 
cultural testing. Like the previously mentioned tests, they are either com- 
pletely non-language or employ translatable instructions. The test items, 
besides being non-verbal, are designed so as to be relatively universal in con- 
tent and to minimize the specific influence of any one culture. Although con- 
Structed especially for use in cross-cultural studies, these tests are also ap- 
plicable in other situations for which non-language tests are desirable. A final 
section will be devoted to the problem of testing the physically handicapped. 
Consideration will be given to the applicability of previously discussed tests to 
individuals with sensory or motor handicaps. Reference will also be made to 
adaptations of existing tests and to specially developed tests available for such 


Purposes. 


PERFORMANCE TESTS 

One of the earliest performance tests was the formboard developed by 
Seguin for use with mental defectives. Originally devised in connection with 
Seguin’s program for the sensory and motor training of the mentally deficient. 
this formboard was subsequently incorporated into a number of performance 
Scales. A photograph of the Seguin Form Board, as used ina current series 
Of performance tests, will be found in Figure 39. In administering this test, 


Fig. 39, Seguin Form Board (From Arthur Point Scale of Performance Tests, 
- 39. r : n dá 
*v. Form gw The Psychological Corporation.) 
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the examiner removes the ten pieces and stacks them as shown in the picture, 
instructing the subject to put them back as fast as he can. Three trials are al- 
lowed, the subject's score being the time required for the fastest of the three. 
The Seguin Form Board is one of the simplest formboards employed in per- 
formance scales, being suitable for relatively low mental ages. Many other 
formboards of increasing complexity have subsequently been developed for 
higher levels. 

A number of other performance tests were developed to meet special test- 
ing requirements, following the early application of the Binet scales. In his 
pioneer psychological work with delinquent children, Healy realized the need 
for performance tests to supplement the more verbal type of task which pre- 
dominated in the Binet scales. As a result, the Healy-Fernald test series (41) 
was assembled in 1911. This series of 23 tests embraced a wide variety of 
tasks, such as reading, arithmetic, naming opposites of words, code-learning, 
information questions, memory for a picture, construction puzzles, picture 
completion, puzzle box, and others. The tests were not combined to yield a 
single score, as in the Binet, but were analyzed qualitatively in an effort to 
obtain a picture of the child's strengths and weaknesses. 


Several of the performance tests from the Healy-Fernald series have found 


Fig. 40. Healy Picture. Completion Test I. (From Pintner-Pati mance 
Scale; courtesy C. H. Stoelting Company.) aterson Perfor: 
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their way into later scales. An example is the Healy Picture Completion Test 
I, shown in Figure 40. This picture depicts a rural scene from which ten 
small squares have been cut out. The square that best completes each part 
of the picture is to be selected from a large number of pieces and inserted by 
the subject. In Figure 41 will be seen the Healy Picture Completion Test II, 
Which portrays successive scenes from a typical day in a schoolboy's life. 
From each scene, a square piece has been cut out, which the subject must 
Select from those in the box and insert in the appropriate place. In each case, 
the selection of the correct piece depends upon an understanding of the event 
Tepresented in the scene. 

Another early series of performance tests was developed by Knox (50) 
for testing foreign-speaking immigrants upon arrival in the United States. AII 
tests in this series were of the performance type and were administered with- 
Out the use of language. Among the tests included in the series was a set of 


Fig. 41 Healy Picture Completion Test II. (From Arthur Point Scale of Performance 
" * a ictu : 2 
Tests, Rev. Form It courtesy The Psychological Corporation.) 
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formboards of increasing difficulty, such as the one illustrated in Figure 42, 
as well as the Ship Test and the Knox Cube Test. In the Ship Test, ten rec- 
tangular pieces are to be arranged within a wooden frame to make a picture 
of a ship at sea. Undoubtedly this particular picture was chosen because of 
its appropriateness for testing immigrants who had just disembarked from 
an ocean liner. The Knox Cube Test is essentially a test of immediate mem- 
ory for a series of movements. The examiner taps each of four cubes in a 
predetermined order and then indicates that the subject is to do likewise. 
The procedure is repeated with successive series of taps, increasing in length 
and in complexity of sequence. 


Fig. 42. Casuist Formboard. (From Pintner-Paterson Performance Scale; courtesy 
C. H. Stoelting Company.) 


The Pintner-Paterson Performance Scale (67) represented the first major 
attempt to develop a standardized series of performance tests with general 
norms. À number of tests included in this scale were taken from the work of 
Seguin, Healy, Knox, and others, and have in turn been incorporated into 
later scales. The entire scale consisted of 15 tests, although for most testing 
purposes a shorter scale including the 10 most satisfactory tests has been em- 
ployed. Figures 39, 40, 42, 43, and 44 show 6 tests from this series, all of 
which have been included in a later scale. 

Although the Pintner-Paterson Scale represented a considerable advance 
over earlier performance tests with regard to scope of tasks, standardization 
of procedure, and size of normative samples, it still lagged far behind the 
test-construction standards set by the Stanford-Binet or by some of the group 
tests described in Chapter 9. Progress in the development of performance tests 
has been relatively slow. Such tests are still crude, in comparison with most 
verbal tests. The reliability of the Pintner-Paterson is considerably lower than 
that of most verbal scales. It should also be noted that correlations betwee” 
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the Pintner-Paterson and such scales as the Stanford-Binet are fairly low, 
when computed on relatively homogeneous age groups (cf., e.g., 57). 


4 


Fig. 43. Mare and Foal Test. (From Pintner-Paterson Performance Scale; courtesy 


C. H. Stoelting Company.) 


The extent to which total scores on the Pintner-Paterson Scale depend 
Upon speed of performance has frequently been recognized as a weakness of 
the scale, Certain types of individuals may become unduly disturbed by 
the repeated emphasis upon speed. Moreover, aS qiiid 
Cultures vary widely in the degree to which they foster and encourage speed. 
This fact was vividly brought out, for example, in a comparison of white, 
Negro, and American Indian boys by means of the Pintner-Paterson (49). 
Not only did the three groups differ much more in speed than in quality of 
Performance, but marked diflerences in speed were also found between 
urban and rural as well as among other subsamples of the three ethnic 
groups studied. Since speed plays 2 relatively minor part in the daily life of 
the reservation Indian or the rural Southern Negro, it is difficult to convey 
to such subjects the notion that they must hurry through the test as much as 


Possible, 


A number of later performance 
the Pintner-Paterson. The Army Pe 


scales have borrowed extensively from 
rformance Scale, developed for individual 
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testing of recruits during World War I, consisted of 10 tests, several of 
which were taken from the Pintner-Paterson series (84, 85). The Cornell- 
Coxe Performance Ability Scale (24), developed in 1934, utilized many of 
the tests from the Army Performance Scale, a few of which had also been 
part of the Pintner-Paterson. Although familiarity with these various scales 
is helpful in evaluating published studies that have employed such tests, 
most of these early scales have been largely replaced by more recently re- 
vised and restandardized tests, such as the Arthur Performance Scale, to be 
considered below. This scale, too, has been constructed primarily from ma- 
terials taken from earlier series. 


Fig. 44. Manikin Test. (From Pintner- 3 
Stoelting Company.) intner-Paterson Performance Scale; courtesy C. H- 


Form I of the Arthur Performance Scale (8, 9), first released in 1930, 
was based on a restandardization of eight of the P : 


gether with the Porteus Mazes and the Kohs Block 
were restandardized on the same sample of approximately eleven hundred 
school enilires between. tke ages of 5 and 15, about one hundred at each 
age level. Both the Porteus Mazes and the Kohs Block Design had been 
widely used as single tests, prior to their incorporation into the ius Scale, 
and are continuing to be so employed. The tests included in the Arthur Scale 
Iare listed below: 


intner-Paterson tests, to- 
Design. All of these tests 
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1. Knox Cube: See earlier description. 


N 


Seguin Form Board: See earlier description and Figure 39. 


3. Two-Figure Form Board: A more difficult formboard, in which a square and 
a cross are each divided into four pieces to be fitted together. This test is not 
scored in the Arthur Scale, being used only to introduce the subject to the 
“puzzle-test procedure” to be followed in later tests in the series. 


4. Casuist Form Board: This formboard is rendered more difficult by the close 
similarity of various pieces, which necessitates finer discriminations. See 


Figure 42. 


5. Manikin: A crude wooden figure of a man is to be assembled from arms, legs, 


head, and trunk. See Figure 44. 
Feature Profile: Given wooden pieces are to be assembled to form a face in 


profile. 


6. Mare and Foal: A relatively easy picture-completion test in which each piece, 
being of a different shape, fits only in its proper recess. No more pieces are 
provided than are actually needed. See Figure 43. 


7. Healy Picture Completion 1: See earlier description and Figure 40. 


8. Porteus Mazes: Described below. 
9. Kohs Block Design: Described below. 


The raw score on each test is translated into a point score that weights each 
test in proportion to its ability to discriminate between Suceesstve ae levels. 
Thus tests that show marked progress between successive age levels receive 
higher weights than those exhibiting smaller age diflerences in performance. 
The sum of the point scores is converted into an MA, from which an IQ is 
Computed by the traditional ratio method. : 

The Porteus Maze Tests (68, 69, 70). first developed by Porteus in 1924, 
Consist of a series of printed line mazes. steeply graded a difficulty. The 
Mazes can be administered with no verbal instructions by using the easier 
Mazes for demonstration purposes. They range from the gyar to the adult 
level, The standard procedure is to have the subject trace with a pencil the 
Shortest path from the entrance to the exit of the maze, without ever lifting 
the pencil from the paper. There is no time limit, and subjects are not hur- 
Tied in any way. As soon as an “error” is made, by either crossing a line or 
entering a wrong pathway. the subject Is stopped and given a second trial on 
an identical maze. If an error is made on the second trial, a failure is re- 
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corded for that level. At the higher levels, four trials are allowed. Scoring 
takes into account the trial in which each maze was successfully completed. 
No spontaneous correction of errors is permitted, the maze being removed as 
soon as any error is made. In his presentation of this test series, Porteus has 
repeatedly described it as a measure of foresight and planning capacity. He 
maintains that it excels verbal tests in measuring those aspects of intelligence 
most important in practical social sufficiency. The Porteus Mazes have been 
used in investigations on a wide variety of subjects, including normals, men- 
tal defectives, patients with organic brain damage, delinquents, and many 
different ethnic and cultural groups. 

In the Kohs Block Design (51), the subject is presented with a set of 
identical 1-inch cubes, whose six sides are painted red, blue, yellow, white, 
yellow-and-blue, and red-and-white, respectively. Colored designs are pre- 
sented on each of 17 test cards, the subject being required to reproduce each 
design by assembling the proper blocks. The number of blocks required 
varies from 4 to 16. Each design has a time limit, extra credit being given 
for completing it in less time. 

In 1947, a Revised Form II of the Arthur Performance Scale was released 
(10). This form was developed primarily as an alternate for Form I, to be 
used in retesting. Norms on Form II were derived on 968 pupils from the 
same “middle-class American district" used in standardizing Form I. Special 
efforts were made in the development of Revised Form II to prepare direc- 
tions suitable for deaf children; the use of language is thus reduced to à 
minimum in its administration. This form consists of five tests, including one 
newly developed test and revised versions of the Knox Cube, Seguin Form 
Board (Fig. 39), Porteus Mazes, and Healy Picture Completion Test I 
(Fig. 41). Most of the revisions concern the instructions or represent minor 
changes in the specific test materials. The new test is the Arthur Stencil De- 
sign Test I, which is pictured in Figure 45. This test is similar to the Kohs 
Block Design in so far as the subject must reproduce designs of increasing 
complexity which are presented singly on cards. In the Stencil Design Test. 
however, the design is reproduced by superimposing cut-out stencils in differ- 
ent colors upon a solid card, several overlapping stencils being required for 
the more complex designs. 

The data on reliability and validity reported by Arthur are meager, al- 
though some pertinent information has been obtained by other investigators: 
A test-retest correlation of .85 was found when Form I was administered 
over a two-year interval to a small group of mentally defective boys (65): 
Considerable practice effect was found over this period. Such a result sug- 
gests that retests with the same form should be given only when the interval 
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is very long. In reference to validity, the major criterion employed was that 
of age differentiation, a criterion on the basis of which tests were both selected 


and weighted. 


E o UTE 
Fig. 45. Arthur Stencil Design Test I. (Courtesy The Psychological Corporation.) 


A number of correlations of the Arthur Scale with the Kuhlmann-Binet 


and with the Stanford-Binet have been reported, ranging from the .50’s to 


the low .80’s (34, 42, 82). Although high, most of these correlations indicate 


Some differences in the information provided by the two types of tests. Some 
Clinicians maintain that an individual's relative performance on the Stanford- 
Binet and Arthur scales may itself have diagnostic value as an index of 
*motional and social adjustment. Results on this point, however, are some- 
What inconsistent. In interpreting relative performance on these tests, other 
factors such as the individual's educational and cultural background must 
be taken into account. No one explanation can be found for all intra-indi- 
Vidual differences between verbal and performance test scores. Attempts 
have also been made to look for diagnostic significance in the relative per- 
formance on different tests within the Arthur Scale. An additional point to 
Consider in connection with such a procedure is the relatively low reliability 
Of the separate tests, which makes the obtained differences in test scores of 


d x 

Oubtful significance. 
i Recent studies with both norma! *? 
hat MA’s on the Arthur Form I run sign 


] and mentally defective children indicate 
ificantly lower than on the Stanford- 
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Binet (31, 36, 60). In particular, it appears that both the Stencil Design 
Test and the Healy Picture Completion Test II may be standardized at too 
high a level. Any such differences in test standardization or deficiencies in 
normative data will, of course, further complicate diagnostic analyses of 
relative performance. 


NON-LANGUAGE GROUP TESTS 


The first non-language group test was the Army Examination Beta (84, 
85), developed for testing foreign-speaking and illiterate soldiers in the 
Army during World War I. The Beta was given to all men who fell below 
a certain score on the Alpha. In this group were included not only those 
who were handicapped by foreign-language background or illiteracy, but 
also those who performed poorly on Alpha for any other reason. Since the 
Beta had a lower test floor than the Alpha, it discriminated better than 
Alpha at the lower levels. 

Instructions for the Army Beta tests were administered by means of ges- 
ture, pantomime, and demonstrations on specially prepared blackboard 
charts. The examiner’s task was much more difficult in the Beta than in the 
Alpha. In addition, the procedure required the services of a trained demon- 
Strator who did before the group what the subjects were later required to do 
in the test booklets. The subjects responded by drawing lines or making 
simple marks, except in one test which required the writing of numbers. 

In the construction of the Beta, an effort was made to pattern it as closely 
as possible after the Alpha, since it was designed as a substitute for the 
Alpha. Subtests for the Beta were selected Principally on the basis of their 
correlations with Alpha and with total Beta scores. The Army Beta placed 
considerable emphasis on speed. Most of its subtests seem to measure chiefly 
spatial orientation or perceptual speed and accuracy. The Beta correlated 
approximately .80 with Alpha and .73 with the Stanford-Binet. It should be 
noted, however, that these correlations were found on groups of enlisted 


men representing a wide range of ability. Within more hom 


ogeneous samples. 
the correl 


ations between Beta and verbal tests would undoubtedly be lower. 
As in the Army Alpha, several civilian revisions of the Army Beta were de- 
veloped. A current form is the 1946 restandardization of the Revised Beta 
Examination (48). This form consists of six subtests, including (a) mazes, 
(b) symbol-digit substitution, (c) pictorial absurdities, (d) paper form- 
board, (e) picture completion, and (f) perceptual Speed. Both administra- 
tion and scoring were considerably simplified in this version. Some language 
is used in giving the instructions, although the explanations rely principally 
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on the practice exercises that precede each subtest. Total scores are ex- 
pressed as deviation IQ's. One of the chief uses of the Revised Beta is to 
be found in mass industries employing many persons with foreign back- 
ground or with little education. It is also sometimes administered in penal 
institutions as a supplement to verbal group scales. 

A well-known non-language scale, designed for elementary school chil- 
dren, is the Pintner Non-Language Test (66). This test was originally con- 
structed for use with deaf children. It was principally for this purpose, too, 
that the previously described Pintner-Paterson Performance Scale was first 
developed. In his pioneer studies of deaf children, Pintner found that the 


linguistic retardation of such children is so great that they cannot be properly 


tested with any verbal instrument. The performance and non-language scales 
were thus developed for individual and group testing of deaf children, re- 
Spectively. Both instruments have subsequently been employed with many 
other types of subjects. 


In its current revision, 
equivalent forms, K and L. This test 


the Pintner Non-Language Test is available in two 
is suitable for grades 4 to 9, and parallels 
the Intermediate form of the Pintner General Ability Test, Verbal Series 
discussed in Chapter 9. Since it was standardized on a population of known 
ability in terms of the Pintner Verbal Test norms, the scores on the two 
lests are directly comparable. The Pintner Non-Language Test is ordinarily 
administered with simple oral instructions, but pantomime directions are also 
available for use with deaf and foreign-speaking children. All items are of 
the multiple-choice type. Each form consists of six subtests, which are il- 
lustrated in Figure 46. All subtests are timed, but the time limits are de- 
Scribed as fairly generous. Norms were established on over six thousand 
School children tested in different parts of the country. Total scores are ex- 
Pressed as deviation IQ's with an SD of 16. Mental age and percentile 
Norms are also provided. Split-half reliabilities of the two forms on a single- 
Year group were found to be .86 and .89. 
Subtests for the Pintner Non-Language Test were selected on the basis of 
their parallel-form reliability, correlation with total score, and correlation 
With the Intermediate Test of the Pintner Verbal series. Within each subtest, 
items were chosen in terms of grade increment m percentage passing. The 
Criteria employed were thus internal consistency. correlation with a verbal 
test, and grade differentiation. For homogeneous age groups, correlations 
in the .60’s are reported between the Non-Language and the corresponding 
Verbal Test of the Pintner series. These correlations are high enough to 
indicate considerable overlap, but low enough to justify the use of the Non- 
Language Test as a supplement to the Verbal Test. Such correlations should 
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1. Figure Dividing: Choose the line or lines which show how the figure at the left 
à oon be cut up to look like the pieces at the right. 


2. Reverse Drawings: The two drawings at the left are alike, but the second has 
! been turned over and one line is missing from it. Find the missing line 
among those given. 


3. Pattern Synthesis: If the two drawings at the left are superimposed, which of 
the four given drawings will the result look like? 


4. Movement Sequence: The figure at the left has been turning in the direction 


indicated by the three successive positions. 


Find the correct fourth position 
from those given. 


5. Manikin: Find the man who is holding up his arms like the first one. 


Cae a 


6. Paper Folding: The drawings at the left show a piece of paper which has been 
folded twice and a piece has been cut out of it. Find the drawing which shows 
how the paper would look if it were unfolded. 


Fig. 46. Sample Items from the Pintner Non-Language Test. 


(Copyright by world 
Book Company.) 
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also be borne in mind, however, when the test is employed as a substitute 
for the verbal tests with deaf and other specially handicapped groups. It 
must be recognized that somewhat different abilities are being tapped by 
the two types of tests. By their very nature, verbal and non-verbal tests can- 
not be regarded as completely interchangeable. 

An important question to consider regarding non-language tests concerns 
the extent to which they depend upon spatial and perceptual functions, as 
contrasted to the symbolic manipulation of abstract relations, concepts, and 
factual information. The latter functions would seem to resemble more 
closely those required in the traditional verbal tests of "intelligence." To be 
sure, the substitution of pictorial for verbal or numerical content may ap- 
the test. At the same time, all pictorial or non- 
language tests cannot be indiscriminately grouped together. Some, like the 
current form of the Pintner Non-Language Test, stress spatial and perceptual 
factors almost to the exclusion of other functions. Such tests will be seen to 
resemble quite closely some of the special aptitude tests to be considered in 
Part 3. Other non-language tests employ a greater proportion of items calling 
for ideational or symbolic responses. Examples of the latter type include the 
Chicago Non-Verbal Examination (17) and the SRA Non-Verbal Form 
(58). Figure 47 shows sample items from the SRA test. Both tests need 


preciably alter the nature of 


e box of the picture that 


[n each row. put an X in th 


is MOST DIFFERENT. 


Fig. 47. Sample Items from the SRA Non-Verbal Form. (McMurry and King, 58. 
Copyright by Science Research Associates.) 

further standardization and other technical improvements. In their present 
form they are of interest chiefly because they illustrate attempts to measure 


Conceptual abilities without the use of verbal or numerical content. 
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A. similar approach is illustrated by the recently developed Tests of Gen- 
eral Ability, or TOGA (29). These tests consist of a series of five ere 
lapping batteries, extending from kindergarten to grade 12. In some p 
they resemble the tests for primary grades discussed in Chapter 9, except Lr 
they utilize similar pictorial and diagrammatic material at all grade leve S: 
They are not truly non-language, since extensive use of oral language is 


PART I 
+ 
A B Ç D E 
SS, 
EAD 
See S 
Find the one that uses dry cells. (Answer: D) 
PART II 
= 
A | B c D E 
| 
i l | 
A | B c D E E 
| | 
| 
In each row, mark the one that does not follow the same rule as the other four. 
(Answers: D, B) 


E 


Fig. 48. Practice Items from Tests of 
(Reproduced by permission of John C. Flana. 


General Ability (TOGA), 
gan.) 


Grades 9 to 12. 
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made throughout. Each battery requires a total of 35 to 45 minutes of testing 


time and consists of two parts. Part I uses pictures to test general information 
and the understanding of concepts learned in school, home, or community. 
Part lI is a classification test employing geometric forms. This part is de- 
Signed to test abstract reasoning with material that is relatively free from 
Specific cultural content. Sample items of both types from the highest level 
(grades 9 to 12) are illustrated in Figure 48. More information would be 


desirable on just what is measured by the items in Part I, especially at the 


Upper levels. Some require the application of concrete factual knowledge or 


of a principle that is described in fairly simple terms. Many of the items, 
anding of a key word, such as “resiliency” 
9r "pachyderm." One wonders, incidentally, why an individual who has 
Mastered such vocabulary needs a non-reading test. Illiteracy and reading 
disabilities are unlikely to be associated with such knowledge. 


Because of the recency of their publication, the TOGA batteries cannot 


be adequately evaluated. Available data on reliability and concurrent valid- 
ity, however, appear favorable. In single grade levels, split-half reliabilities 
Of total scores range from .80 to .90, correlations with general achievement 
tests range from 52 to .81, and correlations with school grades range from 
:38 to .50. A correlation of .65 is reported with the Stanford-Binet, and cor- 
relations from .41 to .80 were found with a number of predominantly verbal 
BrOup tests. Two types of norms are provided, in terms of grade level and 
ratio IQ. No information is given to show that the variability of the IQ re- 
Mains constant with age. Although major emphasis is placed on total scores, 
Some normative data ATE also included that permit a comparison of an in- 
dividual's scores on Parts I and II and an evaluation of the significance of 
the difference between the two part scores. Such comparisons are recom- 


Mended for children with unusual cultural backgrounds. 


however, depend upon the underst 


“CROSS-CULTURAL” TESTING 
on-language tests were designed was the 


testing of individuals reared in different cultures or subcultures. It is apparent, 

Owever, that persons from certain cultures would still be handicapped on 

Most non.laneuage tests primarily because of specific information that the 
= sag : 


tests presuppose. The picture-completion test of the Army Beta, for example, 
requires familiarit with such articles as à violin, postage stamp, gun, and 
» Non- Verbal Examination includes pic- 


Pocketknife. Similarly. the Chicago j : 
tures of cooking utensils, tools. a telephone, a radio, a piano player, tele- 
and many other culturally linked objects or 


&raph poles, a basketball game. 


One of the purposes for which n 
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activities. Some of the pictures employed in such performance scales as es 
Pintner-Paterson and the Arthur likewise require knowledge that is apum c 
to our culture. In so far as speed influences scores on any of these tests, 
moreover, individuals from certain cultures or subcultures would have a 
decided advantage. 

To reduce the cultural restrictions characterizing most non-language and 
performance tests, several attempts have been made to construct "culture- 
free" tests. In order to evaluate such tests properly, a number of points 
should be noted at the outset. In the first place, no test can be truly "cul- 
ture-free." Since every test measures a sample of behavior, it will reflect any 
factor that influences behavior. Persons do not react in a cultural vacuum. 
There is a mass of evidence to indicate the basic role that cultural factors 
play in behavior development (cf. 5, Ch. 18). It is, however, theoretically 
possible to construct a test that presupposes only experiences that are com- 
mon to different cultures. Such a test would not be free from cultural in- 
fluences, but would utilize only elements common to many cultures. For this 
reason, the term “cross-cultural” has been employed here to characterize 
these tests, in preference to the more common but misleading term “culture- 
free.” 

It should also be noted, however, that no existing test is universally ap- 
plicable or entirely unrestricted in its cultural reference. The difference is 
one of degree, cross-cultural tests being less restricted than others. Any test 
tends to favor individuals from the culture in which it was developed. The 
mere use of paper and pencil or the presentation of abstract tasks having 


no immediate practical significance will favor some cultural 


groups and 
handicap others. Emotional 


and motivational factors likewise influence test 
performance. Among the many relevant conditions differing from culture to 
culture may be mentioned the intrinsic interest of the test content, rapport 


with the examiner, drive to do well on a test, desire to excel others, and past 


habits of solving problems individually or cooperatively (cf. 5, pp. 561- 
568). 


Each culture encourages and fosters certain abilities and ways of behav- 
ing, and discourages or suppresses others. It is theref 
on tests developed within the American culture, 
generally excel. If a test were constructed by the same procedures within 4 
culture differing markedly from ours, American subjects would probably 
appear deficient in terms of test norms. Data bearing on this type of cultural 
comparison are very meager. What evidence is available, however, suggests 
that persons from our culture may be just as handicapped on tests prepared 
within other cultures as members of those cultures are on our tests (cf. 5: 


ore to be expected that, 
American subjects will 
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Pp. 566-568). It is interesting to observe that the same relationship holds in 
the case of urban and rural subcultures. When two forms of a test, developed 
on urban and rural groups. respectively, were administered to new samples 
of urban and rural children, the urban groups excelled on the urban form 
and the rural groups on the rural form (5. p. 533; 78). 

Three distinct approaches to the testing of persons in different cultures 
Should be recognized. Each diflers in its fundamental objectives. First, differ- 
M tests may be developed within each culture and validated against local 
criteria only. This approach is exemplified by the many revisions of the 
Original Binet scales for use in different European, Asiatic, and African 
groups, as well as by a few tests whose development was initiated within 
Particular cultural groups (e. 59, 62, 83). In such instances, the tests are 
Validated against the specific criteria they are designed to predict, and per- 
formance is evaluated in terms of local norms. Each test is applied only 
Within the culture in which it was developed and no cross-cultural compari- 
Sons are attempted. It might be added that, in comparison with the mass of 
tests available in America, the number developed for use in other national 
9r cultural groups is small. Psychological testing is a predominantly American 
Movement. Moreover, even when not constructed by American psychologists, 
the tests prepared in other countries frequently follow the pattern set by 
American tests. 


A second major approach is t 
administer it to individuals with d 


o make ùp a test within one culture and 
ifferent cultural backgrounds. Such a pro- 
cedure would be followed when the object of testing is prediction fa losal 
criterion within a particular culture. In such a case, if the specific cultural 
loading of the test is reduced, the test validity may also drop. since the 


Gfiterion itself is culturally loaded (cf. 4). For example, an individual who 
vill probably also be handicapped in school 


Scores poorly on a verbal test v 3 : 
al ability within that particular culture. On 


Work or in jobs calling for verb: À z 

the other hand, we should avoid the mistake of regarding any test de- 

veloped within a single cultural framework as a universal yardstick for 

Measuring “intelligence.” Nor should we assume that a low score on such a 
o Ò " 


test has the same explanation when obtained by a member of another cul- 
ture as when obtained by a member of the test culture. 

The third approach to the testing of different cultural groups involves the 
Choice of items common to many cultures and the validation’ of the resulting 
test against local criteria in many different cultures. This is the basic ap- 
Proach of the cross-cultural tests, although the repeated validation in differ- 
Ent cultures has often been either neglected altogether or inadequately exe- 
cuted. Without such a step: however. We cannot be sure that the test is 
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atively free from culturally restricted elements. Moreover, it is possible 
og st constructed entirely from elements that are equally familiar in 
om tot might measure trivial functions and possess little validity in 
nel f entis] criteria in any culture. For both reasons, therefore, validity 
semet. rechecked in terms of criteria considered important within each 
culture. 


Fig. 49. Typical Materials for Use in the I 
The test illustrated is the A 
C. H. Stoelting Company.) 


-eiter International Performance Scale. 
nalogies Progression Test from the Six-Year Level. (Courtesy 


An example of a cross-cultural test is 
Performance Scale (54, 55). This series Of tests was developed through 
several years of use with different ethnic groups in Hawaii, including elemen- 
tary and high school pupils. It was subsequently applied to several African 
groups by Porteus and to a few other national groups by other investigators- 
A later revision, issued in 1948, was based upon further testing of American 
children, high school students, and Army recruits during World War Il. A 
distinctive feature of the Leiter scale is the almost complete elimination of 
instructions, either spoken or pantomime. Each test begins with a very easy 
task of the type to be encountered throughout that test. The comprehension 
of the task is treated as part of the test. The materials consist of 


provided by the Leiter International 


a response 
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frame, shown in Figure 49, with an adjustable card holder. All tests are ad- 
ministered by attaching the appropriate card, containing printed pictures, to 
the frame. The subject chooses the blocks with the proper response pictures 
and inserts them into the frame. 

The Leiter scale was designed to cover a wide range of functions, similar 
to those found in verbal scales. Among the tasks included may be mentioned: 
matching identical colors, shades of gray. forms, or pictures; copying a block 
design; picture completion; number estimation; analogies; series completion; 
Tecognition of age differences; spatial relations; footprint recognition; simi- 
larities; memory for a series; and classification of animals according to habi- 
lat. These tests are arranged into year levels from 2 to 18.1 They are ad- 
ministered individually with no time limit. The scale is scored in terms of 
MA and ratio 1Q, although there is no assurance that such an IQ retains 
the same meaning at different ages. In fact, the published data show con- 
siderable fluctuation in the standard deviation of the IO's at different age 
levels, Split-half reliabilities of .91 to .94 are reported from several studies, 
but the samples varied widely in age and probably in other characteristics. 
Validation data are based principally on age differentiation and internal 
Consistency. Some correlations are also reported with teachers’ ratings of 
intelligence and with scores on other tests, chiefly the Stanford-Binet. The 
latter correlations range from .64 to .81 but were obtained on rather hetero- 


Beneous groups. 
The IPAT Culture Free Intell 


at the Institute for Personality and A 
e in three levels: 


igence Test (22) was developed by Cattell 
ind Ability Testing, University of Illinois. 
This test is now availabl Sese 1, for apes + toi B ana 
feebleminded adults; Scale 2. for ages 8 to 12 and unselected adults; and 
Scale 3, covering a range from high school pupils to superior adults. Each 
Scale has been prepared in two parallel irn, Ad Hh PSI T Sequi 
individual administration for at least some of the tests; the other scales may 
be given either as individual or as group tests. Scale 1 comprises eight tests, 
nly four of which are described by the author as “culture-free.” The other 
four involve both verbal comprehension and specific cultural information. It 
'S suggested that the four “culture-free” tests can be used as a sub-battery, 
Separate norms being provided for this abbreviated scale. Scales 2 and 3 
are alike, except for difficulty level. Each consists of the following four tests, 


Sample items from which are shown in Figure 50. 


also available as ihe Artis Adaptada ot the Leiter 
Stk í int scale by Dr. Grace Arthur (11). This 
Tnationa] P. cale, standardized as a poin " E m 
e js dona] Performance pea esting children between the ages of 4 and 8 years. 


1 
Intente. tests in year levels 2 to 12 are 
Scal 
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Test 1. Series 


à MAO 


Test 2. Classification 


pen oie 


Test 3. Matrices 


paeng 


Test 4. Conditions 


KORSA 


Fig. 50. Sample Items from IPAT Culture Free Test of Intelligence, Scale 2. (CoPY- 
right by Institute for Personality and Ability Testing.) 


1. Series: Select the item that completes the series. 


2. Classification: Mark the one item in each row that does not belong with the 
others. 


Matrices: Mark the item that correctly completes the given matrix, or pattern- 


4. Conditions: Insert a dot in one of the alternative designs so as to meet the 
same conditions indicated in the sample design. Thus in the example repro" 
duced in Figure 50, the dot must be in the two rect 
This condition can be met onl 
been marked. 


angles, but not in the circle. 
y in the third response alternative, which has 


For Scale 1, only ratio IQ's are provided. In Scales 2 and 3, scores can 
be converted into deviation 1Q’s with SD's of either 24 or 16 
latter conversion was added later in order to insure m 
IQ's obtained on other familiar tests. 
test, it is thus import 


points. The 
ore comparability with 
In interpreting IO's from the IPAT 


ant to note which conversion was used. Scale 2 has been 


standardized on larger samples than either of the other two scales, but the 


representativeness of the samples and the number of cases at some ag’ 
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levels still fall short of desirable test-construction standards (21). Although 


the tests are highly speeded, some norms are provided for an untimed ver- 


sion. Fairly extensive verbal instructions are required, but the author as- 


Serts that giving these instructions 
Will not affect the difficulty of the test. 
Reliability and validity data appear to have been gathered largely on 


Scale 2, some having been obtained with an earlier, longer form of the 


test (20, 23). Split-half reliabilities between .70 and .92 are reported for 


Forms A and B combined. Results changed little when the test was given 
under unspeeded conditions. However, no information is given on the nature 
9f the groups on which these coefficients were found. Immediate retests 
yielded reliabilities in the .80's, but retests over à longer interval in one 
Sample correlated as low as .53. Validity is discussed chiefly in terms of 
al ability factor (g). Factorial validity of 
ermined from its correlations with a pool 
al and performance types. Data on 


in a foreign language or in pantomime 


Saturation with Spearman's gener 
the IPAT test was accordingly det 
of intelligence tests, including both verb: Ee. 
Concurrent and predictive validity in terms of non-test criteria are virtually 
Non-existent, The IPAT tests have been administered in several European 
Countries, in America, and in certain African and Asiatic cultures. Norms 
tended to remain unchanged in cultures moderately similar to that in which 
the tests were developed; in other cultures, however, performance fell con- 
Siderably below the original norms. ! - 

The Progressive Matrices (71, 72, 73), developed in Great Britain by 
Raven, were also designed as a measure of Spearman’s g factor. Requiring 
chiefly the eduction of relations among abstract items, this test is regarded 


by most British psychologists as the best available measure of g. It consists 
9f 60 matrices, or designs, from each of which a part has been removed. 
The subject chooses the missing insert from six or eight given altérnátives. 
The items are grouped into five series, each containing 12 matrices of in- 
creasing difficulty but similar in principle. The earlier series require accuracy 

e difficult series involve analogies, permuta- 
her logical relations. Two sample items 
administered with no time limit, and 


Very simple oral instructions are re- 


Of discrimination; the later, mor 
tion and alteration of pattern, and oth 
are reproduced in Figure 51. The test 15 
Can be given individually oF inge 
quired, 

Percentile norms 
14 years, and for each five 
Norms are based on British samples, 
Military service tested during Wand Wer 
Similar norms were obtained by fuse 


are provided for each half-year interval between 8 and 
-year interval between 20 and 65 years. These 
including 1407 children, 3665 men in 
II, and 2192 civilian adults. Closely 
(75) on 1680 children in Argentina. 
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Fig. 51. Sample Items from the Progressive Matrices. (Reproduced by permission of 
J. C. Raven.) 


Use of the test in several European countries likewise indicated the ap- 
plicability of available norms. Studies in a number of non-European cultures, 
however, have raised doubts about the suitability of this test for groups with 
very dissimilar backgrounds (13, 19, 45, 52, 53, 63, 64). In such groups: 
moreover, the test was found to reflect amount of education and to be 
susceptible to considerable practice effect. 

The manual for the Progressive Matrices (72) is quite inadequate, giving 
little information on reliability and none on validity. Many investigations 
have been published, however, that provide relevant data on this test. In @ 
review of publications appearing prior to 1957, Burke (19) lists over 50 
studies appearing in England, 14 in America, and 10 elsewhere. Since that 
time, research has continued at a rapid pace, especially in America where this 
test has received growing recognition. 

Retest reliability in groups of older children and adults that were mode! 


ately homogeneous in age varies approximately between .70 and .90. i» 
the lower score ranges, however, reliability f 
values. Correlations with both verbal and 
range between .40 and .75, 
with verbal tests. Studies wit 


alls considerably below thes? 
performance tests of intelligence 
tending to be higher with performance than 


: h mental defectives and with different occupt" 
tional and educational groups indicate fair concurrent validity. Predictive 


validity coefficients against academic criteria Tun somewhat lower than thos? 
of the usual verbal intelligence tests. Several factorial analyses suggest that 
the Progressive Matrices are heavily loaded with a factor common to most 


intelligence tests (identified with Spearman’s 8 by British psychologists) 
but that spatial aptitude, inductive reasoning, perceptual accuracy, and other 
group factors also influence performance (19). Í 
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An easier, colored form of the Progressive Matrices has been prepared for 
use with children between the ages of 5 and 11 and with feebleminded 
adults (73). At this level, the test is available in both book form and board 
form, the latter requiring the subject to insert the right piece rather than 
choose the correct completion. The book form, however, is more readily 
obtainable and is described as adequate for most testing purposes. Another 
form, with a higher ceiling. has been specially developed for testing superior 
groups (30, 71), but its distribution is restricted to approved and registered 
users, 

On the whole, the Progressive Matrices show considerable promise for a 
Variety of testing purposes, but more systematic data are needed on norms, 


reliability at different levels, and validity. It should be noted that this test, 
d culture-free tests, are applicable to most situations 


às well as other so-calle 
comparison with non-language 


for which non-language tests Were devised. In 
tests, the culture-free tests have the advantage of requiring less culturally 
restricted information. A further advantage is that current culture-free tests 
depend more heavily upon abstract reasoning and less on spatial aptitudes 
than is true of available well-standardized non-language tests. 

Another example of a completely non-language cross-cultural test is the 
Semantic Test of Intelligence (STI) constructed at Harvard University un- 
der contract with the Army (76, 77). Designed primarily aS ja) ONCE test 
and utilizing only pantomime and demonstration in the instructions, STI was 
developed to identify illiterates in military service who can profit from liter- 
acy training. This test requires the learning of a new set of semantic symbols, 


which are then combined into short “sentences.” Figure 52 shows two 
Sample items employed at the two-symbol stage. By reference o the key at 
the top of the page, jt will be seen that the first symbol stands for bd 
the second for “jumping.” the third for “woman, and the fourth for “lying 


down.” The first sample item contains the symbols for “cow” and "jumping." 
Hence the correct picture is that of the jumping cow. ae this: picture has 
been encircled. The second item has the symbols for | woman" and “lying 
down." Accordingly the woman lying down has been encircled. 

The highest semantic level reached by the RH test is the four-symbol 
Sentence, such as "woman kicks lying-down dog." In the more difficult 
items, the subject must abstract the common feature associated with the 
Particular symbol and apply it to objects unlike ate those in the key. A 
Parallel-form reliability coefficient of .855 was foun ina group with HESS 
rowly restricted ability range. In follow-up studies of 314 Marine Corps 
recruits in slow-learning classes, STI predicted subsequens performance on 
subject-matter examinations and reading instructors’ grades better than these 
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could be predicted from either General Classification Test or Army Beta 


Scores. 


8 (rimo ase) 


IRES 


ee v me CE 


Fig. 52. Sample Items from the Semantic Test of Intelligence. (Reproduced by per- 
mission of P. J. Rulon.) 


A somewhat different approach is illustr 
Man Test (32), in which the sub 
of a man; make the ver 
without ch 


ated by the Goodenough Draw-a- 
ject is simply instructed to “make a picture 
y best picture that you can." This test was in use 
ange from its original standardization in 1926 until 1961. An ex 
tension and revision was published in 1961 under the title of Harris-Good- 
enough Test of Psychological Maturity (37, 38). 
original test, emphasis is placed upon the child's ac 
upon the development of conceptual thinking, rath 
Credit is given for the inclusion of individual bo 
proportion, perspective, and similar features, A t 
were selected on the b 
the test, and relation t 


In the revision, as in the 
curacy of observation and 
er than upon artistic skill. 
dy parts, clothing details. 
otal of 73 scorable items 
asis of age differentiation, relation to total score 0? 


© group intelligence test score. D 
were obtained by testing samples of 50 boys 


from kindergarten to the ninth grade in urba 
and Wisconsin, stratified according to father’s occupation, 

In the revised scale, subjects are also aske 
and of themselves. The Woman Scale is sco 
to those in the Man Scale. The Self Sc 


ata for this purpose 
and 50 girls at each grade level 
n and rural areas of Minnesota 


d to draw a picture of a woman 
Ted in terms of 71 items similar 


ale has been developed as a projective 
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test of personality, although available findings from this application are not 
promising. Norms on both Man and Woman Scales were established on new 
samples of 300 children at each year of age from 5 to 15, selected so as to be 
representative of the U.S. population with regard to father’s occupation and 
geographical region. Point scores on each scale are transmuted into deviation 
IO's with a mean of 100 and an SD of 15. In Figure 53 will be found 
three illustrative drawings produced by children aged 5-8, 8-8, and 12-11, 
together with the corresponding raw point scores and deviation IO's. 


Man: Raw Score 66 
IQ 103 CA 12-1 IQ 134 


Man: Raw Score 7 Woman: Raw Score 31 


CA 5-8 1Q 73 CA 8-8 
Fig. 53, Specimen Drawings Obtained in Harris-Goodenough Test of Psychological 


Maturity. (Courtesy Dale B. Harris.) 

aw-a-Man Test has been repeatedly investigated 
In one carefully controlled study of the earlier 
form administered to 386 third- and fourth-grade school children, the retest 
Correlation after a one-week interval was .68. and split-half reliability was 


89 (56). Rescoring of the identical drawings by a different scorer yielded a 
e bd gs by the same scorer correlated .94. 


Scorer reliability of .90. and rescoring Sa id 
ded similar results. Readministra- 


Studies with the new form (38) have yielded z 
tion of the test to groups of kindergarten children on consecutive days re- 
g s 


v oe F nance on different days. i 
aled no significant difference in performa days. Examiner 


The reliability of the Dr 
by a variety of procedures. 


p 
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effect was also found to be negligible, as was the effect of art pus. 2 
school (38). The old and new scales are apparently quite pp Pd 
scores correlating between .91 and .98 m homogeneous age groups. ae 
correlation of the Man and Woman Scales is about as high as the ee P3 
reliability of the Man Scale found in comparable samples. On this epe 
Harris recommends that the two scales be regarded as alternate forms an 

that the mean of their deviation IQ’s be used for greater reliability. T 

Apart from the item-analysis data gathered in the development of x 

scales, information regarding the construct validity of the test is provided by 
correlations with other tests. For the earlier form, correlations between .41 
and .80 have been reported with other intelligence tests, principally the 
Stanford-Binet (cf. 6). In a study with 100 fourth-grade children, correla- 
tions were found between Draw-a-Man IQ and scores on a number of tests 


of known factorial composition (6). Such correlations indicated that, within 
the ages covered, the Draw-a-Man Test correl 


soning, spatial aptitude, and perceptual 
negligible role in the tests at these ages. For kindergarten children, the 
Draw-a-Man IQ correlated higher with numerical aptitude and lower with 
perceptual speed and accuracy than it did for fourth-grade children (38). 


Such findings suggest that the test may measure somewhat different functions 
at different ages. 


The original Draw-a-Man Test has been 
a supplement to the Stanford-Binet and other verbal scales. It has also been 


employed in a large number of Studies on different cultural and ethnic 
groups, including several American 


indicated that performance on this 
in cultural background th 
In a review of studies p 


ates highest with tests of rea- 
accuracy. Motor coordination plays 2 


administered widely in clinics a 


Indian samples. Such investigations have 
test is more dependent upon differences 
an was originally assumed by its author (cf. 33)- 


ertaining to this test, Goodenough and Harris d 
pressed the opinion that “the search for a culture-free test, whether of intelli- 


gence, artistic ability, personal-social Characteristics, or any other measurable 
trait is illusory (33, P- 399). This view was reaffirmed by Harris in the 1961 
book (38, Ch. 3). Certainly any test requiring the use of paper and pencil 


and involving representational drawing may be expected to show significant 
cultural differences. 


tors at the University of Chicago, under 
Havighurst, compared the performance o 


en from current intelligence tests 


(28). Such studies showed wide variation in “cultural differentials” from 
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item to item. The role of such factors as practice, motivation, testing condi- 
tions, and form of items has also been investigated in relation to socioeco- 
nomic level of. subjects (35). : 

As a result of these studies, a test designed to be relatively free from 
“social-class bias" was developed. Known as the Davis-Eells Games (26), 
this test is applicable from the first to the sixth school grade. The test re- 
quires no reading, all instructions being given orally by the examiner. The 
Content is entirely pictorial and consists of problems chosen from the every- 
day-life experiences of children in the urban American culture. All parts of 
the test are presented as games, the administration being designed to induce 
à comfortable and relaxed atmosphere. Praise and encouragement are also 
freely given to increase motivation. Several of the items represent humorous 
Situations, introduced as a further appeal to the interests of children. The 
Tole of speed is reduced to a minimum. The subtests include Verbal Prob- 
lems, Money Problems, Best Ways Problems, and Analogies. These subtests 
4ppear to have been designed to measure primarily verbal comprehension, 
number ability, spatial visualization and mechanical comprehension, and 
reasoning, respectively. Two sample items will be found in Figure 54. 

The Davis-Eells Games were standardized on a sample of 19,756 school 
Children in 17 cities and 2 counties, situated in 15 states. The subjects were 
Selected so as to constitute à representative sample of the urban American 
Population with respect to geographical distribution, size of community, Sone: 
background, and parental occupation. Scores are expressed as deviation 
IQ’s With a mean of 100 and an SD of 16. The authors recommend, how- 
Ever, that the score on this test be described, not as an IQ, but as an Index 
of Problem-Solving Ability (IPSA), the latter being a more specific designa- 
tion of what the test undertakes to measure. Split-half reliability coefficients 
Were found to be in the .80's from grade 2 to grade 6. In grade 1, however, 
In which a shorter form of the test is employed. the reliability coefficient was 
nly 68. Retests within a two-week interval yielded coefficients of .72 and 

Ü in grades 2 and 4, respectively- But these coefficients may have been 
Somewhat inflated by the subjects’ recall of their previous TESODSESC 

Although reporting a number of moderately high correlations with other 
Stoup intelligence tests and with educational achievement tests, the authors 
JUStify the test primarily in terms of its content validity. They point out, in 
act, that this test is not intended to be a measure of scholastic aptitude, and 
lence should not correlate too highly with either educational achievement 
tests or with other currently available intelligence tests. In some thirty studies 
Which have appeared since its publication in 1953, this test has not fared 
Well. Predictive and concurrent validity coefficients against achievement tests 
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= 


Verbal Problems: Examiner reads three statements about each picture, instructing 
the subject to mark the one which is true. 


Problem A Problem B 


. They are waving at a boy. 1. The man fell down and hit his head. B 

- They are waving at a girl. 2. A ball came through the window and hit the 
man's head. 

- The picture does not show how the man go 
the bump on his head. Nobody can te 
because the picture doesn't show how the 
man got the bump. 


- We cannot tell from this picture 
whom they are waving to. 3. 


“Best Ways” Problems: Which boy is starting to load the packages the best way 
so he can take all three home? 


Fig. 54. Sample Items from the Davis-Eells Games: A Test of General Intelligence 
(Copyright by World Book Company.) 


and teachers' ratings have almost always been lower for the Davis-Eells than 
for conventiona intelligence tests, To be Sure, the test authors would not I*7 
gard this as a serious drawback. A mo ; 
from the common finding that lower-class children perform as poorly on the 


test as on other intelligence tests, Thus the test seems to have sacrificed 
predictive validity without eliminating “cultural bias.” 


" rs ns 
re telling criticism, however, sten 
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TESTING THE PHYSICALLY HANDICAPPED 


The growing emphasis upon rehabilitation and training of the physically 
handicapped has created an increasing demand for appropriate testing in- 
Struments. The testing of deaf children has already been noted as the primary 
Object in the development of the original Pintner-Paterson Performance 
Scale, as well as the Pintner Non-Language Test. Similarly, deaf children 
Tepresent one of the special groups for which the Arthur Performance Scale 
Was prepared. In the Revised Form II of this scale, the verbal instructions 
Tequired in Form I were further reduced in order to increase the applicability 
Of the test to deaf children (10). Many other tests that have been discussed 
in this chapter can be used with the deaf. 

Although adapted to the testing of the deaf, all of these tests: have been 
Standardized primarily on hearing subjects. For many purposes it is of course 
desirable to compare the performance of the deaf with general norms estab- 
lished on hearing persons. At the same time, norms obtained on deaf children 
are also useful in a number of situations pertaining to the educational de- 


Velopment of such children. 


To meet this need, the Nebraska Test of Learning Aptitude, developed 


by Hiskey (43), was standardized on deaf and hard-of-hearing children. 

his is an individual test suitable for ages 4 to 10. Speed was eliminated, 
since it 1s citer ty convey the ana of speed to yearn OD children, An 
attempt was also made to sample a wider variety of intellectual functions 
than those covered by most performance tests. pamon and practice 
exercises to put across the instructions, as well as intrinsically interesting 
items to establish rapport, were considered important requirements tos such a 
test, All items were chosen with special reference to the limitations of deaf 
Children. the final item selection being based chiefly upon the criterion of 


age differentiati 
entiation. “ 
a IS; 
The Nebraska Test consists of eleven subtests, as follow 


Pictorial Identification 


l. ie j^ 
2 pemoryfor Colored Objects d. Paper Folding 
3. pi ag Stringing 9. Visual Attention Span 
: m Associations 10. Puzzle Blocks 
| Ok Building 11. Pictorial Analogies 


Memory for Digits 
* Completion of Drawings 


aus 


, 0 years, who were attending 
Norms 466 children, aged Ate £0 years, S 
ms are based on 466 ates. Norms for hearing children 


i ; " af in six St 
"esidentia] state schools for the deaf in six 
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were added later. A split-half reliability of .96 is reported. pep » 
subtests with total score range from .63 to .84. A correlation a 829 Ms 
found between Nebraska and Stanford-Binet tests on 380 hearing children, 
The manual contains a general discussion of desirable practices in testing 
deaf children. 

Testing the blind presents a very different set of problems from those en- 
countered with deaf subjects. Oral tests can be most readily adapted for 
blind subjects, while performance tests are least likely to be applicable. A 
good introduction to procedures for testing the blind—together with a sum- 
mary of the principal intelligence, special aptitude, achievement, and Pd 
sonality tests prepared for this purpose—may be found in a manual prepare 
by Bauman and Hayes (12) and ina survey by Rawls (74). . 

In addition to the usual oral presentation by the examiner, other suitable 
testing techniques have been utilized, such as phonograph records and tape 
or wire recordings. Some tests are also available in braille. The latter tech- 
nique is somewhat limited in its applicability, 
ness of materials printed in braille as compare 
reading rate for braille, 
facile br 


however, by the greater bulki- 
d with inkprint, by the slower 
and by the number of blind persons who are not 
aille readers, The subject's responses may likewise be recorded 10 
braille or on a typewriter. Specially prepared embossed answer sheets OF 
cards are also available, especially for use with true-false, multiple-choice 


and other objective-type items. In many individually administered tests. ° 
course, oral responses can be obtained. 


Among the principal exam 


ples of general intelligence tests that have bee? 
adapted for blind subjects 


are the Binet and the Wechsler. The first Hayes- 
Binet revision for testing the blind was based on the 1916 Stanford-Binet- 
In 1942, the Interim Hayes-Binet ? was prepared from the 1937 Stanford- 
Binet (39, 40). All items that could be administered without the use ? 
vision were selected from both Form L and Form M. This procedure yielded 
six tests for each year level from VII to XIV, and eight tests at the Aver 
age Adult level. In order to assemble enough tests for year levels III to vi, 
it was necessary to draw upon some of the special tests devised for use i^ 
the earlier Hayes-Binet. Most of the tests in the final scale are oral, a few 
requiring braille materials. A retest reliability of .90 and a split-half reliabil- 
ity of .91 are reported by Hayes. A correlation of .83 was found betwee” 
this test and the earlier Hayes-Binet. Correlations with braille editions ? 


standard achievement tests ranged from .82 to .93. The validity of this test 
was also checked against school progress. 


2 Originally- designi Eee bos T As - 
? Originally designated as an interim edition because of the tentative nature of its standarc 
tion, this revision has now come to be known by this na 


me in the literature. 
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The Wechsler scales. to be discussed in Chapter 12, have also been 
adapted for blind subjects (cf. 12). These adaptations consist essentially in 
Using the verbal tests and omitting the performance tests. A few items in- 
appropriate for blind subjects have also been replaced by alternates. When 
tested under these conditions, blind subjects as a group have been found to 
equal or excel the general seeing norms (12). A number of the group intelli- 
gence tests discussed in Chapter 9 have likewise been adapted for use with 
the blind. Among them may be mentioned the Kuhlmann-Anderson, Otis. 
Pintner Verbal Series, and the Scholastic Aptitude Test of the College En- 


trance Examination Board. 


_ A type of subject that has only recently 
M psychological testing is the orthopedically handicapped (1, 14, 15, 27, 


44, 46, 47, 79, 80). Although usually able to receive auditory and visual 
Stimulation, these individuals may have such severe motor handicaps as to 
Make either oral or written responses impracticable. The manipulation. of 
formboards or other performance materials would likewise meet with diti- 
culties, Working against a time limit or in strange surroundings often à 
Creases the motor disturbance in the orthopedically handicapped. Their 
Breater susceptibility to fatigue makes short testing sessions necessary. 

Some of the severest motor handicaps are found among the cerebral 
Palsied, Yet surveys of these cases have frequently employed common intelli- 
Sence tests such as the Stanford-Binet or the Arthur icum ur Scale. In 
Such studies, the most severely handicapped were usually excluded: us un- 
testable, Frequently, informal adjustments In testing — made in 
od to adapt the test to the see! — capacities (cf. 16). Both of 

ese proce -ourse, are makeshilts. o. 

A imer aa lies in the rir E e L E 
Suitable for even the most severely handicapped indivi punt j race 
the development of special tests and the 
Adaptation of existing tests for use with the c bonia. eniin 1 

er of relevant techniques are still in an aed tè Leiter ica da 
cen fully described in the literature. en ble for administ ind 

*rformance Scale and the Porteus Mazes, pap lá ^" prepa ration to 
“rebral-palsied children, have been pens x d iie die 

© examiner manipulates the test materia = i : c ed 

np imilar adaptation of the Stanford-Binet 
Y appropriate head movements. A 5 


üs i "T 
2 worked lec pe Matrices provide a promising tool for 
is M previously pr pena with no time limit and since the response may 
S pur i tis g m ins. it appear 

è wie oie noe p writing, oF by pointing or nodding, it appears to be 

cated orally, 1 d 


begun to receive special attention 


lese; ; f 
esearch is now in progress on 
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especially appropriate for the orthopedically handicapped. Despite the me 
bility and simplicity of its response indicator, this test covers a wide range o 
difficulty and provides a fairly high test ceiling. Successful use of this test 
has been reported in studies of cerebral-palsied children and adults (1, 44, 
81). ; 

Another test that permits the utilization of a simple pointing response 1$ 
the Full-Range Picture Vocabulary Test (2, 3). This test was designed as 
a rapid measure of "use" vocabulary, especially for persons unable to vocal- 
ize well, such as the cerebral palsied. The test consists of 16 plates. each 
containing four cartoon-like drawings. As the examiner speaks each word, 
the subject indicates by pointing or other signal which of the drawings best 
fits the given word. Two forms are available, each including 85 words of 
increasing difficulty. Norms are provided from a mental age of 2 to the 
superior adult level. These norms are based on a total of 589 cases, including 
15 boys and 15 girls at each year of age from 2 to 17, and some additional 
adult subjects. Although the individual normative samples are small, they 
were chosen so as to be representative of the general population in occupa" 
tional level and in age-grade placement. Preliminary data suggest that the 
test has satisfactory reliability and that it correlates highly with Stanford- 
Binet vocabulary scores. 

A test that was originally developed for estimating the intellectual level of 
cerebral-palsied children is the Columbia Mental Maturity Scale (18). This 


scale comprises 100 items, each consisting of a set of three, four, or five 
drawings printed on a 6-by-19-inch card. The subject is required to identify 


the drawing that does not belong with the others, indicating his choice by 
pointing or nodding. To heighten interest and appeal, the cards and drawings 
are varicolored. Scores are expressed as mental ages and ratio IO's. In 
studies with non-handicapped subjects, these IO's yielded a median correla- 
tion of .77 with Stanford-Binet IQ's for single-age groups. Split-half reliability 


of the tests is estimated at .94, This scale appears promising for testing 
children with severe motor handicaps, 


à although more data are needed t° 
evaluate its effectiveness. 
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CHAPTER ll 


RR ee ee He. Pur 


Infant and Preschool Tests 


nts and preschool children require individual 
arten children can be tested in small groups 
ed for the primary grades. In general, how- 
ble until the child has reached school age. 


Most tests for children below the age of 6 are either performance or oral 

tests. A few involve rudimentary manipulation of paper and pencil. f 
It is customary to subdivide the first five years of life ds ie id os genua 
and the preschool period. The first extends from birth tp the S 9 pc dnt 
mately 18 months; the second, from 18 to 60 months. From the piper D 
test administration, it should be noted that the intant must be tested while 
he is either lying down or supported on a person s lap. Little or no y js is 
Possible during this period. Most of the tests deal with sensory and motor 
development. The preschool child, on the other hand, can TE ^ at PEN 
use his hands in manipulating test objects, and communicate Dy Migs t 
the preschool level, the child is also much more n Pn vbi examiner 
às a person, while for the infant the examiner serves primarily as a means of 
: Preschool testing is a more highly interpersonal 


Providing specific objects. PASE i : 
Dae cs that augments both the opportunities and the difficulties 


presented by the test situation. ; 
In the x ee Nee sections, infant and preschool tests will be considered 
gs 


Separately. The distinction cannot be rigidly applied, glee pee a num- 
ber of current scales overlap both periods Moreover, it should be borne in 


mind that all such stages are arbitrary, Since behavior development is actually 
gradual and continuous. Dividing lines are introduced only to facilitate 
a ; 


description. 


All tests designed for infa 
administration. Some kinderg 
With the types of tests construct 
ever, group tests are not applica 


INFANT TESTS 
e series of investigations of infant behavior is that 
evelopment under the direction of 
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One of the most extensiv! t 
conducted at the Yale Clinic of Child D 
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his co-workers began a longitudinal 
development in the human infant 
by repeated observations of 107 
and homogeneous sample. Only 
ects were included in the surveys 
the parents were of middle socio- 
erage in amount of education and 
rican born and of north-European 


infants constituting a relatively “normal” 
healthy children free from any known def, 
The children were carefully selected so that 
economic status and close to the general av 
in occupational level. All 
extraction. 


The infants in this group were examin 
and at every 4-week interval thereafte 
follow-ups were made at 18 mo 


examinations were continued oy, 
were available. Onl 


parents were Ame 


ed at the ages of 4, 6, and 8 weeks, 


T until the age of 56 weeks. Later 
nths, and at 2, 3, 4, 5 


er a 10. 


k tire group was examined at any one 
age level, the numbers varying from 28 to 60 through age 5; at age 6, 
only 18 cases were included Approximately the ; 


a 


development the child has att 


1. Motor behavior: covers both gross bodi 
nation. This category includes Post 
standing, creeping, walking, 
lation of objects. 


and finer motor coordi- 


ura d 
ral head balance, sitting. 


N 


- Adaptive behavior: covers ey 
objects, solution of practical problems, and ex 
objects. Examples include reactions to such sti 
and a dangling ring, as well as drawing 


e-hand coordination in reachi 

and manipulation of 
ubes, a ringing bell, 
Simple formboards- 


muli 


3. Language behavior: covers all m 
pression, gesture, postural move 


eans of communication 
Comprehension of communicati 


; "Min » Such ; ia x- 
ments, Prelinguistic Vocalizati as facial e: 
on by others jg als 1o; 


x Ns, an ech. 
o included, d spe 
4. Personal-social behavior: covers “the child’s Persona] reacti 
culture in which he lives." Among the types Of behavior jos to the social 
feeding, toilet-training and response to trai 


ing i IS cate re 
nin ory a 

: f igang B in Other Social] 8 y d 
situations, play, development of a "sense of Property» sma ly impose 


H smi 
responses to persons, and responses to mirror. ling and other 


In general, the Gesell Developmental Schedules re 
Iz) 


Present 4 
procedure for observing and evaluating the course of 


.. Standardized 
behavior development 
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in the child’s daily life. Although a few may be properly described as tests, 
most of the items in these schedules are purely observational. A set of stand- 
ard toys and other test objects, reproduced in Figure 55, is employed in 
making these observations. Specifications regarding clinical crib, infant-sup- 
porting chair, test table, observation play pen. and other equipment required 
in the course of the behavior examinations are also provided (cf. 21, pp. 448- 
455). The direct examination of the infant is supplemented by interview 


data obtained from the mother. 


Fig. 55. Test Objects Employed with the Gesell Developmental Schedules. (Courtesy 
The Psychological Corporation.) 


uted for the entire behavior schedule, the authors 


arguing against such a composite measure. Instead, the record indicates the 
approximate developmental level, in months, that the child has attained in 
each of the four major areas covered. This is found by comparing the 
child's behavior witi that gen #5 piel 9 eight “key ages," viz., 4, 18, 
28, and 40 weeks; and 12. 18. 24, and 36 months (cf. 21, Ch. 3). In the 
preparation of these developmental scales, behavior items were classified 
into "increasing," “decreasing,” and “focal” items. The first category includes 
behavior whose frequency of occurrence increases with age; the second 
covers behavior that decreases with age; and the third, behavior that in- 
creases up to a certain age and then decreases. Increasing and decreasing 
behavior items were allocated to those age levels at which the frequency of 


No single score is comp 
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occurrence was closest to 50 Per cent; focal items were assigned to the age 
levels at which they occurred with 


€. Transfers ring 


d. Reaches With one hand 
Fig. 56. Drawings Employed with 


Typical Behavior at 28 Weeks of Ag 
Gesell and Catherine S. Ama 
mission of Paul B. Hoeber, Inc 


the Gesell 

e. (iom Developmental Schedules to Illustrate 
truda. Copyright 1941. 1947, by Diagnosis, by Arnold 
^ Publisher.) > PY Arnold Gesell. By per- 


ings that accompany this description (21 


EA > PP. 44-46). Ta 
indicate behavior that characteristically appears for ie dis in the text 
age. 


St time at this key 


The 28-week-old infant sits with Support, his trunk e 
a brief period with an introductory toy, it is removed 
the FIRST of three CUBES. The baby Seizes it imme, P 1 
grasp and carries it to his mouth. He retains it as the SECOND Cupp dial palmar 
He does not grasp the second cube but he holds 2 cubes more than S Presented. 
when they are placed in his hands. As the THIRD CUBE is Presenteg Ptentarily 


' e drops a 


rect 
ud the Ls Steady, After 
diately wit aminer presents 
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cube. He does not grasp the third cube, but mouths, transfers, drops and resecures 
the cube in hand. 


He follows the screen as it is removed from the MASSED CUBES, then ap- 


proaches the mass with both hands. grasping one cube and scattering the others. 
Holding one cube he grasps another; he may pick up 3 in all. 

He follows the examiner's hand away as the PELLET is presented; gives delayed, 
intent regard to the pellet, and rakes at it with his fingers, contacting it. 

He makes an immediate one-handed approach on the BELL, taking it by the 
bowl or junction. He bangs, mouths. and transfers the bell, retaining it without 
dropping [Fig. 56a]. 

The RING AND STRING are presented, the string obliquely aligned to the 
right, but within reach. He reaches toward the ring, slaps and scratches the table, 
and finally sees the string; he either abandons the effort or fusses. 

The test table is removed. He is placed on his back on the platform. His 
SUPINE posturings are symmetrical, with the legs lifted high in extension or semi- 
extension. He lifts the head as though striving to sit up [Fig. 56b]. He is none too 
tolerant of the supine position and this and the following three situations may have 


to be curtailed or omitted. 
He grasps, transfers and mouths the DANGLING RING, regarding it in hand 


[Fig. 56c]. 
He makes an immediate one-handed approach upon the RATILE [Fig. 564]. 
Shakes it vigorously, regards it and fingers it with the free hand. If it is placed on 


the platform at his side, he reaches for it unsuccessfully. 
When auditory responses are tested by RINGING A BELL opposite first one 


i I 
ear, then the other, he turns his head correctly and promptly. MH 
The examiner now takes his hands and he lifts his head and assists in the PULL- 


TO-SITTING. In the SITTING position he sits for a moment, leaning forward, 
Propped on his hands. He also shows some active balance, sitting erect for a 


fleeting, unsteady moment. 2 ; ; i 
Held in the STANDING position, he sustains à large fraction of weight on his 
extended legs as he bounces actively. i ; z 

legs as he bo! Il lifted, his weight on his abdomen and 


Placed PRONE, he holds the head we 5^ t 
hands. He lifts one arm toward a lure and he tries, unsuccessfully, to pivot. 


Seated before a MIRROR. he regards his image, smiles, vocalizes, and pats the 


glass. 
His LANGUAGE includes cooing, squealing. and combined vowel sounds. He 
Says m-m-mum when he cries. ex we 
t he discriminates strangers, talks" to his toys, takes 


His mother REPORTS tha 
solids well, and even brings his fee 
and sits propped about half an hour. 


t to his mouth. He rolls from supine to prone 


Both observation and scoring procedures are less highly standardized in 
the Gesell Schedules than in the usual psychological test. A certain amount 


of subjectivity enters into the examination at several points. The restriction 
in both size and nature of the normative sample should also be kept in 
mind. No statistical analysis of reliability or validity is reported. In general, 
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s a 

concerned with infant development. The 

schedules were derived represent an impor 

chology. As testing instruments 

ules are relatively crude, 
Another type of measuring in 

vided by several special revisi 


» however, t 


strument suitable for the infant level is pro- 
o 


wise covers most of the infant 


struments for infant 

This scale was developed as a downward 
Binet, Form L. In addition to Stanford-Bin 
material from the Gesell De 

infant tests, together with s 


i ogists Consider one of the most satisfactory in- 
testing is the C. 


ligence Scale (12)- 


extension of the 1937 Stanford- 
et 


ome original items The i into 
n ane in 
, and the MA and ratio IQ are c items are grouped 


> as well as the relati 
each level, permit more precise measurement wi ; 
with most other available infant tests. Continuity ied ale than is possible 
the Stanford-Binet are further advantages, In Gide: Comparability with 
parability of scores on the two scales, certa 


; to insure i: 

in x close com 
i h f 3 Sfoups within the standardiza- 
tion samples were retested at the age o y Standardi: 


cars with Form 
Binet. The placement of items in the Cattell scale was ac the Stanford- 
to yield approximately the same median IO as that obtained Mdjusted ED as 
on the Stanford-Binet. 


Y cach group 
The Cattell scale was standardized on a total 


ing numbers within this sample being retested 
18, 24, 30, and 36 months. As is nearly always 


group of 274 chi 
at the ages gr ten, vary- 


ue in longitudin 
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the entire group was not available for all retests. Nor did it prove possible 
to administer all preliminary items suitable for a particular age to all children 
of that age. The children came from lower-middle-class families, as judged 
by income level and father’s occupation, and were of nortiePuropeen extrac- 
tion. Like other groups employed in longitudinal studies, this sample was 


somewhat selected in terms of stability of residence and willingness of parents 


to cooperate in the study. l l 
erion employed for item selection was increase 


The principal statistical crit : ep 
in percentage of children passing an item from one age to t e next. evera 
d in retaining or discarding 


Other practicz iteria, however, Were applie Us 

items. Se wa ee eliminated if they were difficult to administer or 
score, involved an undue amount of subjectivity on the part of the examiner, 
required cumbersome apparatus, OF failed to hold the attention of young 
children. An effort was also made to minimize the number of items testing 
Primarily muscular coordination Or depending unduly upon specific home 
training. 


All items in the Cattell scale are administered without a time limit. The 


author points out that timed tests are undesirable in testing infants al a 
school children. Not only do younger children L mana fos 
need for hurrying, but a timed test may penalize : E = I. iini 
talkative and imaginative in his use of test ae as we ats » 
deliberate child who plans carefully before he acts. Br most eis ie t : 
Cattell scale requires no more than twenty to Ihitymniges une peu 
administration of the tests is not prescribed, but is modified to suit the inter- 
ests of the child and other specific circumstances. For psp. tests that 
the child takes in a prone position should usually precede those ERAR 
is held on the lap, since he is more likely to object to the prone position after 
he ih 
Seen ui IT required to administer these tests are very 
similar to those employed with the Gesell Developmental Schedules and at 
the lower levels of the Stanford-Binet. it the youngest ages. the tests are 
largely perceptual, comprising such achy as poner toa jsed or bell, 
following a dangling ring or à moving person with the eyes, ooking at a 
Spoon or a cube, and inspecting own fingers. A few motor items, such as 
lifting head, manipulating fingers, or transferring objects From hand to hand, 
are also included. With increasing age, more complex manipulatory tasks are 
introduced and increasing use is made of verbal functions. Blocks, peg- 
boards, formboards, cups. Spoons, dolls, and other toy objects are employed 
at these levels. At the higher ages. the child follows oral instructions in using 
these materials. Naming objects oF pictures of objects and identifying or 
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pointing to objects named by the examiner are among the more highly verbal 
ic ea les of infant scales include the California First-Year Mental 
Mee. by Bayley (4, 5) for use in the Berkeley Growth Study, s 
Northwestern Intelligence Tests prepared by Gilliland (3); and The a 
fiths Mental Development Scale more recently standardized in England (26, 
27). Like most infant scales, these tests have borrowed items 
from each other and from the Gesell and Binet scales. In content, they thus 
have much in common. They differ considerably, however, in adequacy of 
standardization and in amount of information available on reliability and 
validity. In their present form, some are crude and serve only to suggest 
promising testing techniques. Others are sufficiently well developed for im- 


mediate practical application. For specific information and evaluation, 
reader is urged to consult the Mental M 
relevant publications cited for each test. 


extensively 


the 
easurements Yearbooks and other 


EVALUATION OF INFANT TESTS 


The testing of infants presents man 
ing, and hence requires special proce 
tions cannot be used for most tests. 
so that the desired response is elicite 
to “do his best on the test." 
With the examiner must be 


y difficulties in administration and scor- 
dures (cf., ©.g., 12, Ch. 3), Oral direc- 
The examiner must rather set the stage 
d. Similarly, the infant is not motivated 


adversely conditi 


ther problem. The 
any competing stimuli, and it is consequently 
difficult to hold or direct his attention. 


Y the examiner, Few in- 


at may be studied at leisure 
The normative samples em 


ployed in Standardizing infant sc 
and often less representative 


: ales are smaller 
than those used in developing tests for older 
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children. On the other hand, a longitudinal approach has generally been fol- 
lowed in the construction of infant scales, in contrast to the cross-sectional 
approach characteristic of tests standardized on older children. An advantage 
of the longitudinal method stems from the greater uniformity of sampling it 
provides, since the same subjects are tested at successive ages. The cross- 
sectional method is subject to selective factors that may operate differenti- 
ally at different ages. For example, if school children are tested, those in the 
upper grades represent a relatively superior sampling, since the less able tend 
to drop out. 

The reliability of infant tests has generally proved to be lower than that 
of tests for older children. Such a finding is not surprising, in the light of 
some of the previously mentioned difficulties of infant testing. Some of the 
more recently developed infant scales, however, have yielded more promising 
results with reference to reliability. The reliability coefficients of the Cattell 
scale, found at different age levels within the standardization sample, are 
given in Table 16. With the exception of the 3-month level, at which the reli- 


TABLE 16. Split-Half Reliability of the Cattell Infant Intelligence Scale 
(Adapted from Cattell, 12, p. 49) 


Age in Number Reliability 
Months of Cases Coefficient 
3 87 56 
6 100 88 
9 85 86 
12 101 .89 
18 100 .90 
24 80 .85 
30 56 a 


, all other reliabilities fall between .71 and .90. 
y be noted that in a group of 62 3-year-old 
le, Cattell found a reliability of .87 for the 


ability coefficient is only .56 
For comparative purposes, it ma 


Children drawn from the same samp : 
Stanford-Binet, Form L. The Infant Intelligence Scale thus compares favora- 


bly i ; ith the Stanford-Binet. 

‘deals cate were obtained by Bayley with the California First- 
Year Mental Scale (5). For the ages of 1, 2, and 3 months, the reliabilities 
Were only .63, .51, and .74, respectively. Beyond 4 months, however, the co- 
efficients ranged from 75 to .95, with a median value of .86. It should Be 
Noted that items for infant scales are usually selected so as to sample a wide 
Variety of functions. Such heterogeneous content Is unlikely to yield compara- 
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ble halves for the computation of split-half reliability, If closely oe 
forms had been available, the obtained reliability coefficients wou I 
ill higher. 
irren of validity for infant tests is hampered by a dearth of 
suitable criteria. Independent estimates of the intelligence of infants are not 
readily available. We do not have school grades, records of job achievement, 
or Officers’ ratings on infants! To be sure, for extreme deviants 
evidence of ability level can sometimes be obtained. This is esp 
feebleminded children falling into clinical types that have clea 
ble physical symptoms, such as mongolism. Such a criterion, 
was applied by Gilliland to a small Sample employed in the vali 
Northwestern Intelligence Tests. For less extreme deviants, how. 
pecially for the superior deviant, few criterion data can be found. 
As a result, the validation of infant tests h 
criteria, viz., age differentiation and prediction of Subsequent status, The first 
of these criteria is generally employed in the original selection of items, al- 
though it is also used to check both item performance and total Scores on the 
final forms. In terms of this criterion, infant 
ity. Clear-cut and progressive age changes in Performance are found, even 
Over as short a time as a month, 


, independent 
ecially true of 
rly recogniza- 
for example, 
dation of the 
ever, and es- 


as been based largely upon two 


st, the subsequent intellige 
different test, such as the Stanford-B 
would represent a validity coefficien 
à criterion. Such a procedure is sim 
abridged screening tests against in 

The second point concerns the ti 


inet. In this respi 
t against anothe 


ilar to the valid 
dividu 


ect, then, the correlation 


T, well-established test as 
ation 


nection, reference 
ability in Chapter 5. It 
determined over fairly 


he 


; í retest reliability is 
short intervals, Tanging from a few days to a few 
high stability over such short intery. 


ford-Binet. Long-range Prediction of behavior js al 
since so many fortuitous circumstance: 


S May influence the individual's devel- 
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opment in the interim. To be sure, the older the individual, the easier in gen- 
eral will the prediction be, since a larger proportion of his development will 
have already occurred (cf. 1, pp. 231-238; 2). 

Any long-term prediction involves more factors than are operative in the 
usual retest reliability coefficient. Certainly a two-year retest coefficient ob- 
tained with an infant scale should not be compared with a two-month retest 
coefficient on an adult test. It should be further noted, in this connection, that 
intellectual development progresses much more rapidly in infancy than it 
does among older children. Consequently, the nature of the items in infant 
tests may show considerable variation over relatively short intervals. Items 
suitable for the 3-month level may be quite unlike those suitable for the 6- 
month level. This condition further complicates the interpretation of retest 
correlations at the infant level, even when the interval is only three or four 
months. 

In the light of the two points discussed, it would seem that the retest 
coefficients generally reported for infant tests are not truly comparable with 
retest reliability as usually measured in other tests. Whether such coefficients 
are classed under *'validity" or placed in a separate category is of little conse- 
quence. They may properly be considered measures of validity, provided that 
subsequent performance is specified as the criterion and the length of the in- 
terval is indicated. 

We may now examine some typical retest correlations. In Bayley's longi- 
tudinal study with the California First-Year Mental Scale, correlations be- 
tween tests administered under the age of | year and retests at 18 months 
were close to zero (5, 6). With subsequent retests, negative correlations as 
high as —.21 were obtained. From such findings, Bayley concluded that in- 
telligence test scores at age 3 or later can be predicted better on the basis of 
parental education than from tests administered to the child during his first 
year. 

The results for the first year are only slightly better with the Cattell Infant 
Intelligence Scale. Table 17 shows the correlations between Cattell IO's ob- 
tained at the ages of 3, 6, 9. 12. 18, 24, and 30 months, respectively, with the 
Stanford-Binet (Form L) IO subsequently obtained by the same children at the 
age of 3 years. It will be noted that below the age of 12 months the predictive 
Correlations are only .10, .34. and .18. These correlations are little better 
than chance and indicate that such tests had virtually no predictive value in 
terms of the 3-year Stanford-Binet IQ. This lack of predictive validity of 
Cattell IQ's obtained prior to the age of 1 year has been confirmed by other 
investigators (13). Beginning with the age of 12 months, however, the cor- 
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relations are considerably higher, rising steadily from .56 to .83. It might be 
noted for comparative purposes that the Stanford-Binet Form L IQ's obtained 
by the same children at ages 3 and 3% correlated .75. 


TABLE 17. "Validity Coefficients of the Cattell Infant Intelligence Scale 
(Adapted from Cattell. 12, p. 49) 


Correlation with 


Age in Number of Stanford-Binet IQ 
Months Cases at Age of 3 Years 
3 42 10 
6 49 34 
9 44 18 
12; 57 .56 
18 52 67 
24 52 «T1 
30 42 .83 


In the California First-Year Mental Scale, a relatively large proportion of 
the tests administered below the age of 1 year are sensorimotor in nature. 
The Cattell scale contains fewer tests of this type, a special effort having been 
made to exclude them. It should be noted that, when sensorimotor tests are 
given to older children and adults, they show little or no correlation with in- 
tellectual functions. Nor do such tests correlate highly with each other. ^ 
fairly high degree of specificity has generally been found among sensorimotor 
functions. It is therefore to be expected that those infant scales that contain à 
large proportion of sensorimotor items will show little or no correlation with 
subsequently administered intelligence tests. In so far as it is possible to con- 
struct infant tests with items more closely resembling those occurring in tradi- 
tional intelligence tests, better prediction of future intellectual status seems 
likely. 

A promising approach to the testing of infant intelli 


gence involves greater 
utilization of the verbal factor, which is so closely iden 


tified with later intelli- 
gence (3, 32). Items based on the use of language and the understanding of 


words tend to have better predictive validity than the more usual sensori- 
motor items. In very young infants, prelinguistic vocalization appears to bear 
a significant relation to subsequent IQ. Systematic analyses of infant speech 
have revealed consistent developmental trends in such characteristics as num- 
ber of different consonants uttered and ratio of consonants to vowels (10). 
Even within the first six months of life, such speech indices exhibit differences 
among socioeconomic levels which parallel group differences in intelligence 
observed at later ages (10). Infants reared in their own homes, moreover. 
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excel institutional infants in such speech characteristics ( 18, 29), a finding that 
is likewise in line with intelligence test results on older children. An explora- 
tory follow-up study of 23 cases yielded significant correlations as high as .45 
between certain speech indices at the ages of 6 to 18 months and Stanford- 
Binet IO's at the ages of 3 to 4% years (11). Prelinguistic vocalization thus 
appears to be one area worth further investigation in constructing infant tests 


of the future. 


PRESCHOOL TESTS 


Among tests specially designed for the preschool level, the best known are 
the Merrill-Palmer Scale and the Minnesota Preschool Scale. The former was 
developed by Stutsman (36) at the Merrill-Palmer School in Detroit, Michi- 
gan. The tests for this scale were originally assembled from many sources, 
including performance scales, children's games, verbal tests such as the Wood- 
worth-Wells Association Tests, and others. From a total of 78 tests tried out 
On groups of nursery school children at the Merrill-Palmer School, 38 were 
finally retained. The bases of test selection were popularity with children, 
practicality in administration, relation with age, and differentiation of children 
judged by nursery school staff to be bright or dull. A 

The normative sample to which the final 38 tests were administered con- 
Sisted of 631 children, 300 boys and 331 girls, between the ages of 18 and 77 
months. The subjects. all of whom were in Detroit, Michigan, were obtained 
from 20 different sources, including public and private schools, Merrill- 
Palmer waiting list, orphanages, day nurseries, child-care agencies, and health 
Clinics. These children were classified into 6-month age groups, each group 


containing from 49 to 81 cases. 


The 38 tests in the scale yield nts 
Several tests are scored at different levels, depending upon the child's per- 


formance, For each of these test elements. the “age at par” was computed; 
that is, the age at which 50 per cent of the normative sample passed it. The 
test elements were then arranged in ascending order of difficulty, according to 
their age at par. The test elements, or items, were also grouped into 6-month 
age levels, on the same basis. For example, items whose age at par ranged 
from 21.0 to 23.4 were placed into the 18-to-23-months level. The scale is 
suitable for testing children between the ages of 24 and 63 months. 

The Merrill-Palmer Scale is administered as an age scale but scored as a 
Point scale. Testing is usually begun at a level within which the child's chron- 
ological age falls. ]t is continued downward to a level at which all tesisi are 
Passed and upward to a level at which one-half or more of the tests are failed. 


a total of 93 scorable test elements, since 
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Although the tests are arranged in order of difficulty within each level, the 
order of administration is flexible. It is customary to begin with a particularly 
appealing test and to let the child’s interests influence the course of the zs 
ing. The tests are presented in individual, brightly colored, and differently 
shaped boxes, a practice which stimulates curiosity and greatly enhances the 
appeal of these tests for young children. 

The score is determined by crediting one point for each test element 
passed, including all tests below the basal age. In view of the difficulties fre- 
quently encountered in testing preschool children, an adjustment is made for 
refusals and omissions. Any test element that could not be administered is 
credited as passed if it falls below the raw score finally attained, and as failed 
if it falls above that score. Normative data are provided for translating raw 
point scores into mental ages, percentiles, and standard scores. The computa- 
tion of IO's with the mental ages found from this scale is not advocated, since 
the SD of the mental ages does not increase proportionately with age. In fact, 
it increases up to a certain age, and then decreases. 

The Merrill-Palmer Scale includes relatively few language tests, most of 
the tests being of the performance type. The language tests consist of memory 
span for words and for meaningful word groups; answering simple questions 
(e.g., “What does a doggie say?" “What is this?"—]pencil. “What is it for?"); 
and action-agent association (e.g., "What runs?” “What cries?"). A number 
of tests measure predominantly sensorimotor coordination, as illustrated by 


throwing a ball, pulling a string, crossing feet, standing on one foot, closing 
fist and moving thumb, cutting with scissors, buttoning, 


ing with blocks, and fitting cubes into a box. Some invol 
pictures, reconstruction of cut-out picture puzzles, or copying a drawing of a 
circle, cross, or star. A few familiar performance tests are also included, such 
às the Seguin Form Board, Mare and Foal, and Manikin. Speed plays an im- 
portant part in many of these tests, the level at which the test is passed often 
depending upon the time required to complete it. The materials used in ad- 
ministering the Merrill-Palmer tests are pictured in Figure 57. 

The validity of the Merrill-Palmer Scale was determined in part by the 
method of selecting tests in terms of age differentiation and relation to ability 
as judged by nursery school staff. Within the standardiz 
cases, a correlation of .92 was found between chronolo 
on the Merrill-Palmer. This step may be regarded as 


folding paper, build- 
ve matching colors or 


ation sample of 631 
gical age and total score 


a cross-validation of the 
initial item selection in terms of the criterion of age differentiation. The de- 


gree of overlap of scores between adjacent a 
slight in the standardization sample. Validity 
ing the performance of 29 mentally defective 


8€ groups was also found to be 
was further checked by compar- 
Children, aged 41^ to 1212, with 


Infant and Preschool 291 


the norms. Finally, correlations in the high .70's are reported between Mer- 
rill-Palmer and Stanford-Binet mental ages in one mentally defective and 
two normal samples. These correlations are misleading, however, since chron- 
ological age varied widely within these groups. Subsequent studies with more 
homogeneous samples have yielded lower correlations with the Stanford- 


Binet and the Kuhlmann-Binet (cf. 41). 


the Merrill-Palmer Scale. Each test is 
ored box. (Courtesy C. H. Stoelting Company.) 


Fig. 57. Materials for Use in Administering 


Presented in an individual, gaily co! 


No information regarding reliability is given in the test manual, Bur later 
Studies (cf. 41) report retest reliabilities ranging from 12 to .96 with inter- 
vals of two months or less. These coefficients were obtained on small groups 
9f children aged approximately 2 to 5 years and varying widely in ability. 
Such reliabilities, too, would undoubtedly be lower in more homogeneous 


Samples. 
The major weaknesse 
phasis upon motor skills an 


s of the Merrill-Palmer Scale center about its em- 
d upon speed. For the preschool child, speed has 
Not yet become an important goal. The procedures for evaluating scores- are 
also crude, when judged in terms of current standards of test construction. 


The principal asset of this scale is its undoubted appeal to young children, 
both because of the intrinsic interest of the tasks and because of the manner 
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of presenting tests in attractive, gaily colored boxes. Another advantage, 
from the standpoint of preschool testing, stems from the scoring adjustments 
available for handling refusals and omissions. 

It should also be noted that a revision of the Merrill-Palmer Scale is in 
progress. It is anticipated that this revision will extend the scale upwards to 
about the 12-year level and downwards to include a new infant scale pre- 
pared by Nancy Bayley. An attempt will be made to measure the same 
abilities, as identified through factor analysis, at all age levels. 

The Minnesota Preschool Scale (24, 25) is available in two equivalent 
forms, A and B, each consisting of 26 tests. These tests were derived largely 
from the Kuhlmann-Binet; some were taken from other sc 
original tests. The standardization sample consisted of 
group of 900 children, including 100 at each half- 
of age. They were obtained from nursery schools, 
clinics, and settlement houses, and were selected i 
tion so as to constitute a re 
apolis. 


ales and some are 
à carefully selected 
year from 1% to 6 years 
public and private schools, 
n terms of father's occupa- 
presentative sampling of the population of Minne- 


The Minnesota scale contains relatively few motor items. None of the tests 
is timed. The tests are classified as verbal and non-verb 
a separate score. Among the verbal tests are to be fou 
of the body on a large doll; pointing to objects in pict 
telling what a picture is about; following directions; c 
Swering practical questions such as 
gry?”; naming objects from memo 
pictures; digit span; detecting 
ber of words in longest senten 


al, each type yielding 
nd: pointing to parts 
ures; naming objects; 
omprehension, or an- 
"What should you do when you are hun- 
ry; naming colors; identifying incomplete 
absurdities; vocabulary; Opposites; and num- 
ce used spontaneously by the child during the 
examination. The similarity of many of these tests to Stanford-Binet items !5 
apparent. i 

The non-verbal tests include: copying drawings of circle, triangle, and dia- 
mond; imitative drawing of horizontal stroke and vertical cross; block-build- 
ing; Knox cube; form discrimination; form recognition; tracing forms; rear- 
rangement of cut-out picture puzzles; paper-folding; indicating missing parts 
in pictures; and imitating position of hands of clock. The materials employed 
in administering the tests are shown in Figure 58. Many of the pictorial ma 
terials are bound into an examination booklet, which may be seen in the il- 
lustration. 

Norms are provided for each month of 


age between 112 and 6 years- 
Scores on verbal, non-verbal, and tot 


al scale can be expressed as IQ equiv- 


1 Personal communication from Dr. R 


achel Stutsman B. 
Arizona. 


T empê» 
all, Arizona State University, Temp! 
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alents or in terms of specially developed equal-unit scales indicating the diffi- 
culty level of tasks the child can pass as well as his position in a normal dis- 
tribution. Parallel-form reliability, with intervals of one to seven days, ranged 
from .68 to .94 for the verbal scales, from .67 to .92 for the non-verbal, and 
from .80 to .94 for the total scale. These coefficients were found in the 
Standardization sample by correlating Forms A and B within single half- 


year age groups. 


ed in the Minnesota Preschool Scale. (Reproduced by per- 
Minneapolis.) 


Fig. 58. Materials Employ 
Mission of Educational Test Bureau. 


Validity of the Minnesota Preschool Scale was initially determined through 
the criteria of age differentiation and internal consistency, pou) of which were 
employed in the selection and placement of tests. Relationship of feet scores 
to father’s occupational level was taken as further corroborative evidence of 
Validity. In two extensive follow-up studies (24, 33), dus predictive validity 
9f the Minnesota Preschool Scale was checked against intelligence test per- 


formance in later childhood and adulthood. The principal criteria used in 
a ores on the Stanford-Binet and the Army Alpha. The 


these follow-ups were sc urs xy 
y, but in general indicated some predictive value. 


Correlations varied widel 
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For example, 1937 Stanford-Binet IO's obtained between the ages of 412 
and 13/5 years showed a median correlation of .68 with Minnesota Preschool 
1Q’s obtained at 4 years of age and over. With Minnesota IQ’s obtained be- 
tween 3 and 4 years, the median correlation was .61; and with Minnesota 
1Q’s obtained before the age of 3 years, it was .21. 

In the adult follow-up (33), 226 subjects who had been tested with the 
Minnesota Preschool Scale before the age of 6 years were located and given 
the Army Alpha at ages ranging from 16% to 22 years. The principal ob- 
jective of this follow-up was the selection of the most discriminative items in 
terms of the criterion of adult Alpha score. Within a subsample of 46 cases, 
however, correlations were computed between adult Alpha score and initial 
scores on the Minnesota Preschool Scales A and B (33, p. 74). These cor- 
relations were found to be .31 and .33, respectively. It is interesting to note 
that approximately the same correlations were obtained in this subsample be- 
tween Alpha scores and Minnesota Preschool scores based only on the items 
selected in terms of the follow-up criterion. Such a finding suggests that, if 
more items of the same degree of predictive efficiency were added, higher 
correlations might be obtained. 

All in all, the outstanding features of the Minnesota Preschool Scale appear 
to be the careful procedures followed in standardization and in development 
of norms, the availability of parallel forms, the high parallel-form reliability. 
the elimination of speed, the reduction in number of motor items, and the 
extensive follow-up data. On the negative side, the tests in this scale have not 
proved to be as appealing to young children, especially under the age of 3. 
as other preschool tests. This scale is probably most useful in testing normal 
children between the ages of 3 and 5 years. The lack of any provision for 
handling refusals and omissions is also a handicap in testing preschool chil- 


dren. Similarly, the practice of beginning with the easiest items in the scale» 
regardless of the child's age, makes the scale undul 
older and brighter subjects. 

Mention may 


y long and boring for the 


also be made of an intelligence test developed in England by 
Valentine (38), suitable for children between the ages of 1% and 15. Al 
though crudely standardized, this scale combines a relatively large number 
of tests from earlier sources (Binet, Gesell, Merrill-Palmer, ete: and in- 
cludes a few ingenious new tests. No data on reliability or validity are rê- 
ported by the test author. In an investigation by Wakelam (39) on kinder- 
garten and primary grade children, however, promising results were obtained 
regarding retest reliability, as well as concurrent and 
against several academic criteria. 


Attention should likewise be called to a number of well-standardized and 


predictive validity 
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widely used tests suitable for the preschool ages but extending either down- 
ward into the infant level or upward into the school period. These tests have 
already been discussed in connection with other age levels. Among them 
are the Kuhlmann-Binet and the Cattell Infant Intelligence Scale, which 
bridge the gap completely from infancy to school age. When the Cattell scale 
is considered together with the Stanford-Binet, for which it was designed as a 
downward extension, the entire preschool period is adequately covered. Be- 
cause of the wide use and careful standardization of the Stanford-Binet and 
its continuity to the adult level, there is much to recommend the combination 
of Cattell and Stanford-Binet as a means of testing preschool children. 

Among the performance scales discussed in Chapter 10, the Leiter Interna- 


tional Performance Scale extends down to Year Level 2 and is considered 


Suitable for testing normal children at age 4 and over. The point scale adapta- 


tion of the lower levels of this scale was prepared by Arthur as a downward 
extension of the Arthur Performance Scale. 

The Gesell Developmental Schedules include a Preschool as well as an 
Infant Schedule. The Preschool Schedule (19) extends from the age of 15 
months to 6 years. The general approach is similar to that followed in the In- 
fant Schedule. The items are largely observational and are classified under 
the same four types of development, viz.. motor, adaptive, language, and 
Personal-social. Descriptive norms are provided for every three-month in- 
terval between 15 months and 6 years. The normative data were obtained 
from further follow-ups of the infants observed in the standardization of the 
Infant Schedule, as well as from data on several hundred children subse- 
quently examined in the Yale Clinic of Child Development. —— 

_ Two other types of developmental scales may be considered in this connec- 
tion, namely, the Oseretsky Tests of Motor Proficiency and the Vineland So- 
cial Maturity Scale. Both of these scales extend well beyond the preschool 
level, through later childhood and adolescence. They are of special relevance 
to the present discussion, however, because of certain similarities to the Ge- 


sell scales in content and in general approach. They are also more suitable 
al levels than at higher levels. 


for use at the lower age or intellectua i; LH 

The Oseretsky Tests of Motor Proficiency were originally published in 
Russia in 1923 They were subsequently translated into several languages 
dj countries. In 1946, Doll (14), then 


and used in a number of European nt 
Director of Research for the Vineland Training School, sponsored and edited 


an English translation of the Portuguese adaptation of these tests. A scale of 
motor development is especially useful in testing mental defectives, who are 
also frequently retarded in motor functions. Other applications of the Oseret- 
Sky tests are found in the testing of children with motor disorders, in connec- 
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tion with the administration of therapeutic and training aeu E 
range covered by the original Oseretsky tests extended from 4 ps Pues 
the tests being arranged into year levels as in the Stanford-Binet. n en 
age" was likewise computed by a procedure similar to that followed in th 
Stanford-Binet. The Oseretsky scale was designed to cover all major types 
of motor behavior, from postural reactions and gross bedily movements is 
finger coordination and control of facial muscles. Administration of Hess as 
requires only simple and easily obtainable materials, such as matchstic - 
wooden spools, thread, paper, rope, boxes, rubber ball, and the like. Dire 
tions are given orally and by demonstration. A 
In their original form, some of the Oseretsky tests involved the compre- 
hension and recall of fairly complex instructions. Such “intellectual loading 
of the tests would tend to produce spuriously high correlations with oer! 
gence tests and to make interpretation of scores ambiguous. Instructions an 
pm procedures for some of the tests were unclear and inadequately stand” 
ardized. In 1955, the Lincoln-Oseretsky Motor Development Scale (35) pas 
issued as a revision and restandardization of the Oseretsky tests. Covering 
only ages 6 to 14, this revision includes 36 of the original 85 items. Further 
standardization to cover younger ages is anticipated. The tests, which in this 
revision are arranged in order of difficulty, were chosen on the basis of age 
correlation, reliability, and certain practical consider 
centile norms were found on a standardi 
girls attending public schools in central Illinois. Split-half reliabilities eid 
puted for single age and sex groups fell mostly in the .80's and .90's. A one 
year retest yielded a correlation of .70. A factor analysis of a slightly longer 


3 SIRE > A in lop- 
earlier version indicated a single common factor identified as motor develop 
ment. 


ations. Tentative jum 
zation sample of 380 boys and 36 


The Vineland Social Maturity Scale (15, 16) is 
concerned with the individual's ability to look after 
take responsibility. Although covering a range fro 
this scale has been found most useful at the you 
mental defectives. The entire scale consists of 11 
levels. The information required for each item is s db 
situations, but through an interview with an informant or with the subjec 
himself. The scale is based on what the subject has actually done in his daily 
living. The items fall into eight categories: general self-help, self-help in cd 
ing, self-help in dressing, self-direction, Occupation, communication, capil 
tion, and socialization. A social age (SA) and a social quotient (SQ) can 
computed from the subject’s record on the entire scale. i i 

The Vineland Scale was tentatively standardized on 620 subjects, inclu 


a developmental schedule 
his practical needs and ba 
m birth to over 25 year? 
nger ages, as well as with 
7 items grouped into yea 
obtained, not through test 
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ing 10 males and 10 females at each year from birth to 30 years. Validity of 
this scale was determined chiefly on the basis of age differentiation, compari- 
son of normals and mental defectives, and correlation of scores with judg- 
ments of observers who knew the subjects well. A retest reliability of .92 has 
been reported for 123 cases, the retest intervals varying from one day to 9 
months, The use of different examiners or informants did not appreciably af- 
fect results in this group, as long as all informants had had an adequate op- 
Portunity to observe the subjects. An adaptation for blind preschool children 
has been developed and standardized by Maxfield and Buchholz (34). 
Correlations between the Vineland Scale and the Stanford-Binet vary 
Widely but are sufficiently low, in general, to indicate that different facets of 
behavior are being tapped by the two scales. The Vineland Social Maturity 
Scale has proved helpful to clinicians in diagnosing feeblemindedness and in 
Teaching decisions regarding institutionalization. For example, an individual 
Who is intellectually deficient in terms of the Stanford-Binet may be able to 
adjust satisfactorily outside of an institution if his social age on the Vineland 
Scale is adequate. Discrepancies between MA and SA may likewise con- 
tribute to an understanding of certain cases manifesting behavior problems or 
delinquency. 
On the other hand, the available norms must be regarded as tentative. 
The standardization sample obviously included too few cases at each age to 
insure desirable stability and representativeness of norms. Moreover, the sub- 
Jects came chiefly from middle-class American homes. It is apparent that 
Many items, such as those relating to going out alone, use of spending money, 
and the like, would have a different significance for children in different so- 
Cioeconomic levels or in certain minority groups. Cultural differences in child- 
Tearing customs, rather than the child’s ability level, might in such cases ac- 
Count for deviations from the norms. Similarly, a considerable number of 


‘tems are unsuitable for institutional children. 


EVALUATION OF PRESCHOOL TESTS 


Many of the problems encountered in the administration of infant tests are 
also met at the preschool level. Although oral directions can DOW be relied 
Upon to a much ereater extent, the problems of motivation and interest, short 
attention span, dud susceptibility to fatigue remain. In scoring at least some 
Preschool tests, moreover, subjectivity is likely to enter, since many test Te- 
SPonses at this level leave no permanent record. Apart from these difficulties, 


Which are shared with infant tests. preschool testing presents certain prob- 


ems peculiar to its own level (cf. 19, Ch. 11; 25, pp. 11-18; 40, pp. 340- 
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344). As the child's sphere of activity widens, and as he reacts increasingly to 
the interpersonal aspects of the test situation, new problems emerge. 2 

The principal characteristics that may interfere with satisfactory test m 
formance at the preschool age are shyness, distractability, and negativism (cf. 
25). A shy child may be frightened by strange surroundings or a strange ex- 
aminer. He may cry, cling to his mother, refuse to try the tests, and object " 
remaining in the examination room. Especially with younger preschool chil- 
dren, permitting the mother to remain in the examination room or even to 
hold the child on her lap may reassure a shy child. A further problem arises 
from the fact that some preschool children are highly distractable and hy- 
peractive. They cannot remain seated and are constantly running about the 
room, handling materials, performing acrobatics, and inquiring about a vari- 
ety of unrelated and irrelevant matters. The very talkative child of this agè 
may try to take over the examination himself. 

The negativism often exhibited by children of this age level has been ex- 
tensively discussed in the literature of child psychology. In the test situation 
this behavior may take the form of flat refusals to perform, complete silence 
and general unresponsiveness, failure to follow directions in the use of test 
materials, or screaming and temper tantrums. In all but the most resistant 
cases, the examiner may elicit the cooperation of such a negativistic child by 
a number of subterfuges and indirect appeals to his interests, or by à brief 
delay (17, 37). Sometimes the testing session must be postponed because of 
extreme manifestations of any of the three types of behavior described. j 

With regard to standardization procedures, preschool tests have more m 
common with school-age tests than with infant tests. Norms have usually bee? 
obtained by a cross-sectional rather than a longitudinal approach. Two im- 
portant exceptions, however, are the Gesell and the Cattell scales, which ob- 
tained longitudinal norms for the preschool as well as for the infant level 
Normative samples are, in general, larger than those used in the infant tests 
and are more often chosen so as to be representative of a well-defined pop" 
lation. Parallel forms appear for the first time. Contemporary criteria are 
more often employed for validation purposes, although the interest in predic" 
tion of subsequent status persists. 

Some of the follow-up studies conducted with reference to preschool tests 
have fairly broad implications, which extend beyond the specific tests em- 
ployed. In the investigation by Bradway (7) cited in Chapter 8, 138 sub- 
jects who had been tested between the ages of 2 and 5⁄2 years during the 
standardization of the Stanford-Binet Scale were retested with Form L [d 
years later, when they were from 12 to 151^ years old. The data were an?" 
lyzed separately for those children whose age at initial test was between 2 and 
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314 (N = 52) and those whose age was between 4 and 51 years (N = 86). 
In the younger group, correlations between initial IO and 10-year retest 
were .58 and .67 for initial Forms L and M, respectively. In the older group, 
the corresponding correlations were .67 and .63. An analysis of individual 
Stanford-Binet items showed that verbal and memory items had higher pre- 
dictive validity against the 10-year criterion than did non-verbal items (8). 
The correlation of preschool IO's with a 25-year retest, at a mean age of 29 
years, was .59 (9). 

Although these correlations are too low to permit accurate predictions in 
individual cases, they show a substantial stability of test performance over 
periods of 10 and even 25 years. When the same, well-constructed instrument 
is employed in initial test and retest, the predictive value of IO's obtained by 
2- or 3-year-old children appears to be much higher than was previously 
supposed. It should also be borne in mind that these correlations are based 
on a single initial test for each individual. Gesell and others have repeatedly 
pointed out that better prediction of subsequent status can be made on the 
basis of repeated preschool tests. Such retests provide a more reliable esti- 
t the observation of developmental trends. 

The previously cited follow-up study by Maurer (33) with the Minnesota 
Preschool Scale represents an effort to increase the predictive value of pre- 
School tests by item selection. Biserial correlations were found between each 
item on the Minnesota Preschool Scale and Alpha scores obtained by the 
Same subjects as adults. Despite the crudeness of the criterion measure and 
the small number of cases available for the computation of individual item- 
Criterion correlations, certain suggestive findings emerged from this study. 

A comparison of the items retained with those discarded on the basis of 
Predictive value reveals differences that also appear reasonable on other 
grounds, Thus the least-predictive items included several that were unduly in- 
fluenced by motor coordination. It will be recalled that in infant tests, too, 
Such items were found to be least satisfactory for measuring intellectual de- 
velopment. Similarly, tests lacking sufficient interest for young children, as 
Well as those employing confusing or complicated directions, often fell into 
the non-predictive category. ]tems of this sort would usually be judged faulty 
for any testing purpose. Another group of items with low predictive value de- 
Pended largely upon rote memory for information that the child might or 
Might not have acquired during his prior experience. Examples of these items 
Include naming objects or parts of pictures, and pointing to body parts or to 
Obiects named by the examiner. In such tasks, too many fortuitous circum- 


stances may have influenced the child's knowledge of the names. 
Among the most predictive tests in Maurer's investigation were those em- 


mate and also permi 
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phasizing perception of spatial relations, controlled attention, memory, and 
logical relations. Specific examples include incomplete pictures, mutilated pic- 
tures, block-building, discrimination of colors and forms, cut-out picture pue 
zles, Knox cube, definitions, word opposites, detection of verbal absurdities, 
and vocabulary. Once more, verbal tests and those involving abstract con- 
cepts and relations proved to be effective predictors of later intelligence. 

That the functions measured by available intelligence tests change sys- 
tematically from infancy to school age was demonstrated in a factorial analy- 
sis of the IO's obtained in the Berkeley Growth Study (28). The basic data 
for this analysis were the intercorrelations among IO's obtained by the same 
children on successive retests between the ages of 1 month and 18 years. 
During the first year, the children were examined monthly with the California. 
First-Year Mental Scale; later retests at increasingly longer intervals em- 
ployed other appropriate tests, such as the Stanford-Binet and the Terman- 
McNemar Group Test. Factorial analysis of the intercorrelations revealed 
three factors. The first, identified as sensorimotor 
during the first two years of life. The second factor 
the ages of 2 and 4, its weight bein 
gradually to zero in later childhood. 
persistence, or 


alertness, was prominent 
reached its peak between 
& negative in early IQ's and dropping 
This factor was tentatively described aS 


a tendency to act in accordance with an established set, as OP- 
posed to responsiveness to momentary stimulation. It w 


factor may also underlie the stubbornness and negativi 
children at these ages. The third factor, characterized as abstraction and the 
manipulation of symbols, begins to emerge at about 2 years of age and be- 
comes the principal factor in the IO from age 4 on. 

A final word may be added re 
and infant tests. In one respect, s 


as suggested that this 
sm often exhibited bY 


garding the construction of both preschool 


uch tests present the s ai 
encountered in cross-cultural testing. Unlike the school 


preschool child have not been exposed to the standardi 
ences represented by the school curriculum. 
age children or for adults who h 
ing. the test constructor has a 
from which he can dr 
hand, the individual's 


ame problem as t 
child, the infant an 
zed series of expert 
In developing tests for school 
àve completed a prescribed amount of school- 
large fund of common experiential material 
aw test items. Prior to school entrance, on the other 


experiences are far less stand 
broad cultural uniformities in child-rearing 


the development of Satisfactory tests ig much 


: , ain 
ardized, despite certa! 

Practices. Under these condition? 
more difficult. 


N 


- Bayley, Nancy. Menta 
- Bayley, Nancy. On the grow 


- Bradway, Katherine P. I 


- Bradway, Katherine P., Thompson, 
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CHAPTER 12 


The Wechsler Scales and 


Other Clinical Instruments 


This chapter is concerned with two intelligence scales prepared by David 
Wechsler, one for adults and one for children. Although administered as in- 
dividual tests and designed for many of the same uses as the Stanford-Binet, 
these scales differ in several important ways from the earlier test. Both are 
point scales rather than age scales. All items of a given type are grouped 
into subtests and arranged in increasing order of difficulty within each sub- 
test. In this respect the Wechsler scales follow the pattern established for 
Broup tests. Another characteristic feature of these scales is the inclusion of 
Verbal and performance subtests, from which separate verbal and perform- 


ance 1Q’s are computed. 
Besides their use as measures O 
have been investigated as a possible 


f general intelligence, the Wechsler scales 
aid in psychiatric diagnosis. Beginning 
With the observation that brain damage, psychotic deterioration, and emo- 
tional difficulties may affect some intellectual functions more than others, 
Wechsler and other clinical psychologists argued that an analysis of the in- 
dividual's relative performance on different subtests should reveal specific 
Psychiatric disorders. The problems and results. perfumng xo such a profile 
Or scatter analysis of the Wechsler scales will be considered in a separate sec- 
tion. In the final section of this chapter, other types of tests designed to de- 


tect intellectual impairment will be briefly examined. 


MEASUREMENT OF ADULT INTELLIGENCE 
The first form of the Wechsler scales, known as the Wechsler-Bellevue 


Intelligence Scale, was published in 1939. One of the primary objectives in 
ih Nol e A x provide an intelligence test suitable for adults. In 
i 303 


304 General Intelligence Tests 


first presenting this scale, Wechsler (70) pointed out that previously availa- 
ble intelligence tests had been designed primarily for school children and 
had been adapted for adult use by adding more difficult items of the same 
kinds. The content of such tests was often of little interest to adults. Unless 
the test items have a certain minimum of face validity, rapport cannot be 
properly established with adult subjects. Many intelligence test items, written 
with special reference to the daily activities of the school child, clearly lack 
face validity for most adults. As Wechsler expressed it, “Asking the ordi- 
nary housewife to furnish you with a rhyme to the words, ‘day,’ ‘cat,’ and 
‘mill, or an ex-army sergeant to give you a sentence with the words. ‘boy, 
‘river,’ ‘ball, is not particularly apt to evoke either interest or respect” (70, 
pe E79. 

The overemphasis on speed in most tests also tends to handicap the older 
person. Similarly, Wechsler believed that relatively routine manipulation of 
words receives undue weight in such tests. He likewise called attention to the 
inapplicability of mental age norms to adults, and pointed out that few adults 
had previously been included in the standardization samples for individual 


intelligence tests. It was to meet these various objections that the original 
Wechsler-Bellevue was developed. 


In form and content, this scale was closel 


Wechsler Adult Intelligence Scale (WAIS) which has now supplanted it. A 
description of the latter, to be given in the next section, thus serves also tO 
characterize the Wechsler-Bellevue. The earlier scale, however, had a number 
of technical deficiencies which have been largely corrected in the current 
form. A proper evaluation of the many published studies that used the 
Wechsler-Bellevue requires some familiarity with its special limitations. 

The chief weakness of the Wechsler-Bellevue stemmed from the unrepre- 
sentativeness of its normative sample, which was drawn largely from New 
York City and its environs. The total number of adults of both sexes included 
in this sample was only 1081. The reliability of some of the subtests was 
quite low, especially for the proposed profile analysis of subtest scores. Ob- 
solescent items, meager validity data, and inadequacies of the manual were 
among the other deficiencies of this scale. The Wechsler-Bellevue provided 
norms down to the age of 10 years, its total standardization sample including 
670 children between the ages of 7 and 16. The scale was not well suited for 
testing children, however, and was soon replaced at these levels by the Wech- 
sler Intelligence Scale for Children (WISC), also to be considered in this 
chapter. 

The amount of interest aroused by the publication of the Wechsler-Belle- 
vue, as well as the extent of its use in clinical testing and in research, Cà? 


y similar to the more recent 
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be seen in the bibliography of 625 references listed in the Fifth Mental 
Measurements Yearbook for this test alone, exclusive of publications on the 
WISC and WAIS. Two special surveys appearing in the Psychological Bulle- 
tin covered research with the Wechsler-Bellevue published during the years 
1945-1950 (55) and 1950-1955 (30). 


WECHSLER ADULT INTELLIGENCE SCALE 


Description. Published in 1955, the WAIS (73) comprises eleven subtests. 
Six subtests are grouped into a Verbal Scale and five into a Performance 
Scale. These subtests are listed and briefly described below, in the order of 


their administration. 


VERBAL SCALE 

1. Information: 29 questions covering a wide variety of information that adults 
an opportunity to acquire in our culture. An effort was 
made to avoid specialized or academic knowledge. It might be added that 
questions of general information have been used for a long time in informal 
psychiatric examinations to establish the individual's intellectual level and 


his practical orientation. 


have presumably had 


in each of which the subject explains what should 
ces, why certain practices are followed, the 
meaning of proverbs, etc. Designed to measure practical judgment and com- 
mon sense, this test is similar to the Stanford-Binet Comprehension items, 
but its specific content was chosen so as to be more consonant with the 


interests and activities of adults. 


a Comprehension: 14 items, 
be done under certain circumstan 


hose encountered in elementary school 


3. Arithmetic: 14 problems similar to t 
4 P sented and is to be solved without the 


arithmetic. Each problem is orally pre 
use of paper and pencil. 
4. Similarities: 13 items requiring the subject to say in what way two things are 
alike. 
ts of three to nine digits are to be orally 


5. Digit Span: Orally presented lis ; i 
n In the second part, the subject must reproduce lists of two to 


eight digits backwards. 
words of increasing difficulty are presented both orally and 


6. y. "y: 
ocabulary: 40 at each word means. 


visually. The subject is asked wh 


PERF 
ERFORMANCE SCALE 


7. Digit Symbol: This i 
dates back to the early Woodwor 


s a version of the familiar code-substitution test which 
th-Wells Association Tests and has often 
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been included in non-language intelligence scales. The key contains 9 
symbols paired with the 9 digits. The subjects score is the number of 
symbols correctly written within 112 minutes. 


. Picture Completion: 21 cards, each containing a picture from which some 


part is missing. Subject must tell what is missing from each picture. 


. Block Design: This test is similar to the previously described Kohs Block 


Design Test (cf. Ch. 10). In the Wechsler adaptation, however, the blocks 
have only red, white, and red-and-white sides. Subject reproduces designs of 
increasing complexity requiring from four to nine cubes. 


10. Picture Arrangement: Each item consists of a set of cards containing pic- 


tures to be rearranged in the proper sequence so as to tell a story. Figure 59 


Fig. 59. Demonstration Item from the WAI 
by permission of The Psychological Corporation.) 


11. Object Assembly: Modeled after the Pintner- 


S Picture Arrangement Test. (Reproduced 


shows one set of cards in the order in which they 


à : are pres the sub- 
ject. This set shows the easiest of 8 items making up EUM 


the test. 


à g z Paterson Manikin and Feature 
Profile, this test includes improved versions of both of these objects, together 
with two additional objects to be assembled ’ j 


Both speed and accuracy of performance are taken into account in scoring 


Arithmetic, Digit Symbol, 
Assembly. 
Norms and Scoring Procedures. 


Block Design, Picture Arrangement, and Object 


The WAIS Standardization sample Mos 


chosen with exceptional care to insure its Tepresentativeness. The principal 


normative sample consisted of 1700 cases, includin: 
and women distributed over seven age levels betwee 
jects were selected so as to match as closel 
the 1950 U.S. census with regard to part 
dence, race (white versus non-white), occu 


g an equal number of me? 
n 16 and 64 years. Sub- 
Y as possible the proportions 2s 
of the country, urban-rural res!” 
pational level, and education. 
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each age level, one man and one woman from an institution for mental de- 
fectives were also included. Supplementary norms for older persons were es- 
tablished by testing an “old-age sample" of 475 persons, aged 60 years and 
over, in a typical Midwestern city (18). 

Raw scores on each WAIS subtest are transmuted into standard scores 
with a mean of 10 and an SD of 3. These scaled scores were derived from a 
reference group of 500 cases which included all persons between the ages of 
20 and 34 in the standardization sample. All subtest scores are thus ex- 
pressed in comparable units. Verbal, Performance, and Full Scale scores are 
found by adding the scaled scores on the six Verbal subtests, the five Per- 
formance subtests, and all eleven subtests, respectively. By reference to ap- 
propriate tables provided in the manual, these three scores can be expressed 
as deviation IQ's with a mean of 100 and an SD of 15. Such IO's, however, 
are found with reference to the individual's own age group. They therefore 
show the individual's standing in comparison with persons of his own age 


level. 

As can be seen in Figure 60, 
Scale scores rise until the late twent 
A sharper rate of decline was found beyond 


in the WAIS standardization sample Full 
ies and then decline slowly until 60. 
age 60 in the old-age sample. 


nr 
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o 
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Fig. 60. Decline in WAIS Scaled Scores with Age. (Data from Wechsler, 74, p. 95.) 
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By deriving IO's separately for each age level, individuals are thus com- 
pared with a declining norm beyond the peak age. The age decrement is 
greater in Performance than in Verbal scores and also varies from one sub- 
test to another (18). Thus Digit Symbol, with its heavy dependence on speed 
and visual perception, shows the maximum age decline. In the other Per- 
formance subtests, however, speed may not be an important factor in the 
observed decline. In a special study of this point, subjects in the old-age sam- 
ple were given these tests under both timed and untimed conditions. Not only 
were the differences in scores under these two conditions slight, but the dec- 
rements from the 60-64 to the 70-74 age group were virtually the same 
under timed and untimed conditions (74, p. 137). 

In this connection it should be noted that in the WAIS st 
sample both the peak scores and the onset of decli 
in the Wechsler-Bellevue standardization sam 
earlier. These differences could result from a n 


andardization 
ne occur at older ages than 
ple examined some 15 years 


umber of factors (74, pp. 140- 
141). One possible explanation of both the difference between the two 


standardization samples and the age decrement itself is provided by amount 


of education received by subjects in diflerent age groups. Because of the 
rising educational level of the general 


population, older groups at any one 
time have received less education on the average than younger groups. Such 
a difference, of course, is reflected in WAIS and Wechsler-Bellevue stand- 
ardization samples, and should be if these samples are to be representative 
of the general U.S. population (1). Moreover, members of the WAIS stand- 
ardization sample will have received more education on the average than 
persons of corresponding age in the Wechsler-Bellevue sample, sine: the lat- 
amination of educational data 
e differences, 

r of longitudinal investigations 
i à - 242-243), These studies showed that 
intellectually superior groups that continue the; 


S MAR ay need fr. Be 
Reliability. For each of the eleven subtests, as wet e i^ 
ance, and Full Scale IQ's, reliability coe à erbal, Per 


fficients Were co "e 

3 mputed within the 

18-19, 24-34, and 45-54 year samples. These three Sis ees pure as 
g S s 


being representative of the age Tange covered by th 
HERE e E i t 3 e. 
Odd-even reliability coefficients (corrected for de oras a sampl 


ngth by the Spear- 
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man-Brown formula) were employed for every subtest except Digit Span and 
Digit Symbol. The reliability of Digit Span was estimated from the correlation 
between Digits Forward and Digits Backward scores. No split-half technique 
could be utilized with Digit Symbol, which is a highly speeded test. The reli- 
ability of this test was therefore determined by parallel-form procedures in a 
group specially tested with WAIS and Wechsler-Bellevue Digit Symbol sub- 
tests. 

Full Scale IQ's yielded reliability coefficients of .97 in all three age sam- 
ples. Verbal IQ's had identical reliabilities of .96 in the three groups, and 
Performance IQ's had reliabilities of .93 and .94. All three IQ’s are thus 
highly reliable in terms of coefficients of equivalence. As might be expected, 
the individual subtests yield lower reliabilities, ranging from a few coeffi- 
cients in the .60's found with Digit Span, Picture Arrangement, and Object 
Assembly, to coefficients as high as .96 for Vocabulary. It is particularly im- 
portant to consider these subtest reliabilities when evaluating the significance 
of differences between subtest scores obtained by the same individual, as in 
In general, the WAIS subtest reliabilities are higher than 


profile analysis. 
Bellevue, since several tests were lengthened and more 


those of the Wechsler- 


ceiling was added by the insertion of new items. 
The WAIS manual also reports standard errors of measurement for the 


three 1Q’s and for subtest scores. For Verbal IQ, such errors were 3 points 
in cach group, for Performance 1Q, just under 4 points, and for Full Scale 
IQ, 2.60. We could thus conclude, for example, that the chances are roughly 
2:1 that an individual's true Verbal IO falls within 3 points of his obtained 
Verbal IQ. The above values compare favorably with the 5-point error of 
measurement found for the Stanford-Binet (Ch. 8). It should be remembered, 
however, that the Stanford-Binet reliabilities were based on parallel forms 
administered over intervals of one week or less; under such conditions we 


would anticipate somewhat lower reliability coefficients and greater fluctua- 


tion of scores. 
Validity. Any discussion of validity of the WAIS must draw upon research 


done with the earlier Wechsler-Bellevue as well. Because it has been availa- 


ble much longer, the Wechsler-Bellevue has been used in many more inves- 


tigations than the WAIS. Since all changes introduced in the WAIS represent 


improvements over the Wechsler-Bellevue (in reliability, ceiling, normative 


sample, etc.) and since the nature of the test has remained substantially the 
same it is reasonable to suppose that validity data obtained on the Wechsler- 
, a 


Bellevue will underestimate rather than overestimate the validity of the 


WAIS. 
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The WAIS manual itself contains no validity data, but several aspects of 
validity are covered in a subsequent book by Wechsler (74). Chapter 5 of 
that book is devoted to a discussion of the content validity of the Wechsler 
scales. Wechsler argues that the psychological functions tapped by each of the 
eleven chosen subtests fit the definition of intelligence, that similar tests have 
been successfully employed in previously developed intelligence scales, and 
that such tests have proved their worth in clinical experience. The test author 
himself places the major emphasis on this approach to validity. 

Some empirical data on concurrent validity are summarized by Wechsler 


in a chapter on the use of the scales in counseling and guidance (74, Ch. 14). 


These data include mean IQ differences among various educational and oc- 


cupational groups, as well as a few correlations with job performance ratings 
and academic grades. Most group differences, though small, are in the ex- 
pected directions. Persons in white-collar jobs of different kinds and levels 
averaged higher in Verbal than in Performance IQ, but skilled workers 


aged higher in Performance than in Verbal. Verb 
with over- 


aver- 


al IO correlated in the .30's 
all performance ratings in studies of industrial executives and 
psychiatric residents. Both groups, of course, were alrea 


of the abilities measured by these tests. Correl 
were found between Verbal IQ and colle 


dy selected in terms 
ations in the .40's and .50's 


fons with the Verbal Scale, how- 
ever, were not appreciably higher than those obtained with the Stanford- 
Binet and with well-known group tests. 

Of some relevance to the construct validit 
intercorrelations of subtests and of Verbal a 


factorial analyses of the scales. In the Process of standardizing the WAIS, 
intercorrelations of Verbal and Performance Scales and of the eleven sub- 


tests were computed on the same three age groups on which reliability coeffi- 


cients had been found, namely, 18-19 25-34 
, , > and 45. al and 
Performance Scale scores correlated Wy ad. désire Seite 


> and .81, respectively, in these 

three groups. Intercorrelations of Separate subtests werk du es ei in 
Fn a a 

the Cus age prat, Tunning higher àmong Verba] than among fedet 

btests. Correlati R e 

su formance subtests, although still 

ample, in the 25- ar group, 

correlations among Verbal subtests ranged fiom, dee, A year g im 

formance subtests from .44 to .62, and between Performance and SEA di 
erbal s 

Correlations a T be- 
tween total Verbal and Performance Scale scores Suggest Ex correlations 


Y of the Wechsler scales are the 
nd Performance IQ's, as well as 


at the two scales 
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have much in common and that the allocation of tests to one or the other 
scale may be somewhat arbitrary. 

Factorial analyses of the Wechsler scales have been conducted with a vari- 
ety of subjects ranging from eighth-grade pupils to the old-age standardiza- 
tion sample (aged 60-75 4-) and including both normal and abnormal groups. 
They have also employed different statistical procedures and have ap- 
proached the analysis from different points of view. Some have been directly 
concerned with age changes in the factorial organization of the Wechsler sub- 
tests, but the findings of different investigators are inconsistent in this regard 
(6, 14). Surveys of the specific results of this research and references to the 
original sources may be found in Wechsler (74, Ch. 8), Cohen (14), and 
Saunders (59). As an example, we may examine the factorial analyses of the 
WAIS conducted by Cohen (14, 15) with the intercorrelations of subtests ob- 
tained on four age groups in the standardization sample (18-19, 25-34, 45-54, 
and 60-75 +). The major results of this study are in line with those of other 
investigations using comparable procedures. 

That all eleven subtests have much in common was demonstrated in Co- 
hen's study by the presence of a single general factor that accounted for about 
50 per cent of the total variance of the battery. In addition, three major 
group factors were identified. One was a verbal comprehension factor, with 
large weights in the Vocabulary, Information, Comprehension, and Similari- 


ties subtests. A perceptual organization factor was found chiefly in Block 


Design and Object Assembly. This factor may actually represent a combina- 
tion of the perceptual speed and spatial visualization factors repeatedly found 
in factorial analyses of aptitude tests. The results of an earlier investigation 
by Davis (16), in which “reference tests" measuring various factors were 
included with the Wechsler subtests, support this composite interpretation of 
the perceptual organization factor. 

The third major group factor identified by Cohen was described as a mem- 
ory factor. Found principally in Arithmetic and Digit Span, it apparently in- 
cludes both immediate rote memory for new material and recall of previously 
learned material. It is suggested that ability to concentrate and to resist dis- 
traction may be involved in this factor. Of special interest is the finding that 
the memory factor increased sharply in prominence in the old-age sample. 
At that age level it had significant loadings, not only in Arithmetic and 
Digit Span, but also in Vocabulary, Information, Comprehension, and Digit 
Symbol. Cohen points out that during senescence memory begins to dete- 
tiorate at different ages and rates in different persons. Individual differences 
in memory thus come to play a more prominent part in intellectual function- 
ing than had been true at earlier ages. Many of the WAIS subtests require 
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memory at all ages. Until differential deterioration sets in, however, individ- 
ual differences in the retentive ability required in most of the subtests are in- 
significant. 

It should be noted that the results of Cohen’s study fail to support the 
standard practice of grouping tests into Verbal and Performance Scales, each 
yielding a separate IQ (15). Although the use of a Full Scale IQ is justified 
by the large general factor content of all subtests, the verbal comprehension 
factor occurs in only four of the six Verbal Scale subtests. The memory fac- 
tor is found in the two remaining Verbal subtests, as well as in other sub- 
tests from both Scales in the case of older subjects. And the perceptual or- 
ganization factor has significant loadings in only two of the five Performance 
Scale subtests. The remaining Performance subtests seem to have largely 
specific variance, not shared with other subtests in this battery. 

Working with normal samples and using item intercorrelations and other 
procedural variations, Saunders (59, 61, 62, 63) found evidence of at least 
10 identifiable factors in WAIS performance. There was not, however, a one- 
to-one correspondence between these factors and the WAIS subtests. Several 
subtests proved to be factorially complex, and certain factors cut across more 
than one subtest, Of particular interest is some suggestive evidence for a de- 
terioration factor common to Wechsler’s roposed “ ^ p be 
discussed later in this chapter (63). di SS Sele RS S 

Comparison with Other Tests. The Wechsler scales have been repeatedly 
correlated with the Stanford-Binet as well as with other well-known tests of 
intelligence. Several summaries of such correlations have been published 


(cf. 30; 55; 74, Ch. 7). Correlations with the Stanford-Binet in unselected 
adolescent or adult groups gene 


rally fall in the .80's and .90's. Within more 
homogeneous samples, such as college students (3, 58), the correlations tend 
to be considerably lower. Group tests yield somewhat lower correlations 
with the Wechsler scales, s 


uch correlations rangin from abi E 
about .40 to abou 
-80. For both Stanford-Binet and group scales ed 


ations are nearly always 
i ; ] Full Scale, while correla- 
tions with the Performance Scale are much lower than either > 


nnesota Paper Form Board Test in 


a group of 16-year-old boys and girls (36), 


IO's correlated .70 with Raven’s Pro 
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It has been repeatedly found that brighter subjects tend to score higher on 
the Stanford-Binet than on the Wechsler scales, while duller subjects score 
higher on the Wechsler than on the Stanford-Binet. For example, college 
freshmen obtained significantly higher mean IQ's on the Stanford-Binet than 
on the Wechsler Full Scale (3, 58). The same relationship holds if compari- 
sons are made between Stanford-Binet and either Verbal or Performance 
IO's. Mental defectives, on the other hand, receive higher IQ’s on the 
Wechsler. Within any group, as the IO's diverge from 100 in either direction, 
the differences between the two scales become more pronounced (11, 41, 
50). As the extremes of the range are approached, IO's on the two scales may 
differ by as much as 20 points (11). Although these discrepancies are 
smaller on the WAIS than on the earlier Wechsler-Bellevue, significant IO 


differences still remain (47). 
the difference in standard deviation of Wechsler and 


To some extent, 
Stanford-Binet 1Q’s may account for the differences between the IO's ob- 
tained with the two scales. It will be recalled that the SD of the Stanford- 
Binet IQ is 16 (actually fluctuating around this value in the 1937 form), 
while that of the Wechsler IQ is 15. The discrepancies in individual IO's, 
however, are larger than would be expected on the basis of such a difference. 
Another difference between the two scales is that the Wechsler has less floor 
and ceiling than the Stanford-Binet and hence does not discriminate as well 
at the extremes of the IO range. 1 

The relationship between Stanford-Binet and Wechsler IO's depends not 
only upon IO level, but also upon age. Other things being equal, older sub- 
jects tend to obtain higher IQ's on the Wechsler than on the Stanford-Binet, 
while the reverse is true of younger subjects. One explanation for such a 
trend is obviously provided by the use of a declining standard in the computa- 
tion of the Wechsler IO's of older persons. On the Stanford-Binet, on the 
other hand, all adults are evaluated in terms of the average peak age on that 
scale, viz., 18 years (16 years on the earlier torm), It is also possible that, 
since the Stanford-Binet was standardized primarily on children and the 
Wechsler on adults, the content of the former tends to favor children while 


that of the latter favors older persons. It will be recalled that in the construc- 


tion of the original Wechsler-Bellevue a special effort was made to choose 


material appropriate for adults. 


Abbreviated Scales. Since the pu 
abbreviated scales have been proposed. These scales 


ome of the subtests and prorating scores to 
e to the published norms. The fact that sev- 


blication of the original Wechsler-Bellevue 


Scale, a large number of 
are formed simply by omitting 5 


Obtzin a Full Scale !O comparabl ! ; The fac 
eral subtest combinations, while effecting considerable saving in time, corre- 
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late over .90 with Full Scale IO's has encouraged the development and use of 
abbreviated scales for rapid screening purposes. It is possible that the most 
effective combination of subtests varies for different intellectual levels or other 
special populations (55). 


» the best combinati 
Vocabulary, and Block Design; for four tests, Information, Vocabulary, 
Block Design, and Picture Arrangement; and for five tests, Information, 
Similarities, Vocabulary, Block Design, and Picture Arrangement, Other com- 
binations were as good or nearl 


ofa single four-test combination (one of 


i the abbreviated scales. Moreover, the 
assumption that the original Full Scale 


norms are applicable to prorated total 
y not always be justified. 


General Evaluation. The WAIS is unquestionably an improvement over its 
predecessor, the Wechsler-Bellevue, The care with which a representative 
nationwide normative sample was assembled is à particularly ded fea- 
ture of the WAIS standardization, The increase in length and difficulty range 


of subtests has raised reliabilities, alth z 
> although some subtest reliabilities are still 
too low for the type of inter iabilities 


availability of well-establish 
special contribution of the 
tional and cultural level of 
quent rechecking, 
WAIS IQ's vary System. 
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More empirical data on WAIS validity would be desirable. It might be 
noted that nearly all validity data have so far been gathered somewhat in- 
cidentally by investigators not directly concerned with the development or 
distribution of this scale. More systematic investigation of validity would 
strengthen the interpretation of test scores. Factorial analyses support the use 
of Full Scale 1Q’s, because of the large general factor in the test scores. But 
the standard subdivision into Verbal and Performance IQ’s is not well sub- 
stantiated by the results of such research. Abbreviated scales, though plenti- 
ful and correlating highly with Full Scale 1Q’s, should be used sparingly be- 
cause they sacrifice much of the qualitative information that an individual 


clinical instrument should provide. 


WECHSLER INTELLIGENCE SCALE FOR CHILDREN 


Description. The Wechsler Intelligence Scale for Children (WISC) was 
prepared as a downward extension of the Wechsler-Bellevue (65, 71). Many 
items were taken from the Wechsler-Bellevue, easier items of the same types 
being added to each test. The WISC consists of twelve subtests, of which two 
are to Fe used either as alternates or as supplementary tests if time permits. 
As in the other Wechsler scales, the subtests are grouped into a Verbal and a 


Perforsnance Scale, as follows: 


VERBAL SCALE PERFORMANCE SCALE 


1. General Information 6. Picture Completion 

2. General Comprehension 7. Picture Arrangement 
3. Arithmetic 8. Block Design 

4. Similarities 9. Object Assembly 

5. Vocabulary 10. Coding (or Mazes) 

(Digit Span) 


The tests listed as alternates were those giving the lowest correlations with 
the rest of the scale. In the Verbal Scale, Digit Span proved to be the least 
and was therefore designated as alternate. In the Performance 
Scale, either Coding or Mazes may be omitted, the decision being left to the 
examiner. Coding requires less time than Mazes, however, and may be gen- 
erally preferred for this reason. The Coding Test corresponds to the Digit Sym- 
bol Test of the adult scale, with an easier part added. The only subtest that 
does not appear in the adult scale is the Mazes. This test consists of eight 
Paper-and-pencil mazes of increasing difficulty, pertormance being scored in 
terms of both time and errors. If all twelve tests are administered, the total 
Scores must be prorated before the IO is computed. Figure 61 shows a child 
Working on one of the easier items of the Object Assembly Test. 


satisfactory test 
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Fig. 61. The Object Assembly Test of the Wechsler Intelligence Scale for Children. 
(Courtesy The Psychological Corporation.) 


Norms and Scoring Procedures. 


The treatment of scores on the WISC fol- 
lows the procedures used in the 


adult scale, with minor differences. RaW 
Scores on each subtest are first transmuted into normalized standard scores 


within the subject's own age group. Tables 
for every 4-month interval between the 
adult scales, the subtest scaled Scores are 
with a mean of 10 and an SD of 3 po 
added and converted into a deviation I 


of such scaled scores are provided 
ages of 5 and 15 years. As in the 
expressed in terms of a distribution 
ints. The scaled subtest scores are 


O with a mean of 100 and an SD 
of 15. Verbal, Performance, and Full S 


method. Wechsler (72) subsequently d 
age equivalents of WISC scores. Altho 


WISC included 100 boys and 100 girls 
at each age from 5 through 15 years, giving a total of 2200 cases. Each 
child was tested within 11^ months of his midyear, For example, the 5-year- 
olds ranged in age from 5-years-4-months-15-days to 5-years-7-months-15- 
days. Only white children were included. All Subjects were obtained in 


Schools, with the exception of 55 mental defectives, who were tested in insti- 
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cated in 11 states, as well as in 3 institutions for mental defectives. The distri- 
bution of subjects conformed closely to the 1940 U.S. census for the country 
at large, in terms of geographical area, urban-rural proportion, and parental 
occupation. In many respects, the WISC standardization sample is more 
representative of the country at large than any other sample employed in 
standardizing individual tests. 

Reliability. Split-half reliability coefficients are reported for each subtest of 
the WISC, as well as for Verbal, Performance, and Full Scale scores. These 
reliabilities were computed separately within the 7V2-, 1015-, and 13-year 
samples, each age group consisting of 200 cases. Since the odd-even tech- 
nique was inapplicable to Coding and Digit Span, scores on two parts of 
these tests were correlated. Owing to the lack of complete comparability of 
the two parts, however, the coefficients obtained for these two tests probably 
underestimate their reliabilities. The Full Scale reliability coefficients for the 
three age levels were .92, .95, and .94, respectively. The corresponding reli- 
abilities for the Verbal Scale were .88, .96, and .96; for the Performance 
Scale, they were .86, .89, and .90. Thus both the Full Scale and the Verbal 


and Performance IQ’s appear to be sufficiently reliable for most testing 


purposes. NM 
A different picture is presented by the subtest reliabilities. A few of these 


coefficients are in the .50's. Most are evenly distributed in the .60's, .70's, 
and .80's. Only one test, Vocabulary, yielded any coefficients in the .90's; 
and even this test had a reliability of only .77 in the 7V5-year group. It 
might be added that most of the subtests had lower reliability coefficients 
in the youngest age group than in the ofer two groups, Ihe test manual 
rightly cautions the users of this scale against interpreting differences between 
subtest scores without due reference to the reliability coefficients of the par- 


ticular subtests. j 
A four-year follow-up indicated that WISC IO's are about as stable as 


Stanford-Binet IQ's over such an interval (25). When 60 fifth-grade pupils 
were retested in the ninth grade, their Stanford-Binet IO's correlated .78. On 
the WISC, the Full Scale, Verbal, and Performance IQ's correlated .77. .77, 
S ag aati of validity is included in the WISC manual. To be 
sure, the normative tables of standard score equivalents for each subtest pro- 
vide evidence of age differentiation, but no evaluation of the data in terms 
of this criterion is given. It is also relevant to observe. that, of the 55 institu- 
tionalized mental defectives tested in the standardization sample, only 4 ob- 
tained Full Scale 10’s above 70, the mean 1Q of this group being 57 (65, p. 
109). A few independent investigators have found fairly high concurrent 


318 General Intelligence Tests 


validity coefficients between WISC scores and achievement tests or other 
academic criteria of intelligence (cf. 43). As would be expected, the Verbal 
Scale tended to correlate higher than the Performance Scale with such crite- 
ria. 

The WISC manual reports intercorrelations among the individual subtests, 
as well as the correlation of each subtest with Verbal, 
Full Scale scores, and of these three composite scores wi 
correlations are given separately for the 200 cases at each 
in the standardization sample, viz., 742, 10%, and 13 
tions between total Verbal and Performance scores wi 
respectively, in these three age groups. Thus the two 
much in common, although the correlations between t 
justify the retention of both parts. 

An analysis of occu 


Performance, and 
th each other. All 
of three age levels 
V years. The correla- 
ere .60, .68, and .56, 
parts of the scale have 
hem are low enough to 


TABLE 18. Mean IQ's on the Wechsler Intelligence Scale for Children in Relation 
to Paternal Occupation 
(Adapted from Seashore, Wesman, and Doppelt, 65, pp. 107-109) 
Occupational Category PNE cuna. E 
Verbal Performance Full Scale 
Professional and semiprofessional workers 110.9 107.8 110.3 
Proprietors, managers, and officials 105.9 105.3 106.2 
Clerical, sales, and kindred workers 105.2 104.3 105.2 
Craftsmen, foremen, and kindred workers 100.8 101.6 101.3 
Operatives and kindred Workers 98.9 99.5 99.1 
Domestic, protective, and other service workers 97.6 96.9 97.0 
Farmers and farm managers 96.8 9 : 974 
Farm laborers and foremen, laborers 94.6 ORS 94.2 


icat i s 
decline with age, possibly because of Sra ing that these class difference 
ing (19, 20). 
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Attention is also called to the relatively poor performance of the pre- 
dominantly rural groups. This finding was supported by a separate analysis 
of scores with reference to urban and rural residence, and reflects the almost 
universal result obtained with intelligence tests standardized on a predomi- 
nantly urban sampling.’ It should likewise be noted that both urban-rural 
and occupational differences tend to be somewhat smaller on the Perform- 
ance than on the Verbal Scale. This, too, is a typical finding. 

An analysis of the discrepancies between Verbal and Performance scores 
for each individual throws further light upon these group comparisons (64). 
For the entire standardization sample, the mean difference between Verbal 
and Performance IQ was, of course, zero. This follows from the procedure 
employed in computing deviation IQ’s on the WISC. Each age group like- 
wise yielded mean Verbal-Performance differences of practically zero. About 
half of the individual cases showed Verbal-Performance discrepancies of 8 
points or more. In a breakdown with reference to occupational categories, 
all group differences were small and statistically insignificant, with one excep- 
tion. The professional and semiprofessional category contains a significantly 
greater proportion of children with higher Verbal than Performance IO's. 
In this group, 62 per cent had a positive V-P difference, 35 per cent a nega- 
tive difference, and 3 per cent had identical IO's on both scales. In all other 
categories, the proportions of positive and negative discrepancies were ap- 
proximately equal. The differences that did occur, however, were in the 
expected direction, viz., a greater tendency for rural children and children 
from lower socioeconomic levels to obtain higher scores on the Performance 
than on the Verbal Scale. 

Comparison with Other Tests. Comparisons of WISC and Stanford-Binet 
IQ's have yielded results very similar to those obtained with the adult 
Scales. Such comparisons have utilized a variety of groups, including both 
Preschool and school-age children and ranging from mental defectives to 
gifted children (24, 33, 40, 43, 51, 54, 68). Correlations between WISC 
and Stanford-Binet range from the .60's to the .90's, varying with the age, 
intellectual level, and heterogeneity of the samples. The Verbal Scale again 
correlates more highly with the Stanford-Binet than does the Performance 
Scale, With such a test as the Arthur Performance Scale, on the other hand, 
the reverse is true, the WISC Performance 1Q yielding a higher correlation 


than the WISC Verbal IQ (54). As in the adult scales, normal and superior 


children tend to score higher on Stanford-Binet than on WISC. The dis- 
crepancy in favor of the Binet is greater for brighter and younger subjects. 


1 For a further discussion of typical findings and of their implications for test construction, 
a further dis " 


cf. Anastasi (2, pp. 525-533). 
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For the mentally retarded, the WISC yields a significantly higher mean IO 
than the Binet. 

General Evaluation. On the whole, the WISC compares favorably with the 
more recently developed WAIS in the quality of its test-construction proce- 
dures. The size and representativeness of its normative sample and the care- 
ful procedures followed in determining reliability set a particularly high 
standard in test development. The dearth of validity data remains its prin- 
cipal weakness. More attention should also be given to the consistent dis- 
crepancies between WISC and Stanford-Binet IQ's 
tellectual levels. Finally, there seems to be something of a paradox in the 
underlying rationale of the WISC. It will be recalled that a major reason 
for the development of the original Wechsler-Bellevue was 
adult intelligence test that would not be a mere 
children's scales. Having presumably achieved this objective, the author 
then proceeded to prepare a children's scale tha 


à t was simply a downward 
extension of the adult scale. Is this a case of "Heads I win and Tails you 
lose"? 


at different ages and in- 


the need for an 
upward extension of available 


PROFILE ANALYSIS WITH THE WECHSLER SCALES 


i lora : ts for diagnosing intellectual im- 
pairment or deterioration resulting from brain damage, psychotic disorders, 
Onnection, a distinction must be 

made be s ; 
ade between ment e hand, and mental deterioration, 


milarly, it is 
al disturbances will interfere 

Which require f a- 
SAN : f à careful observa 
tion and concentration, while leaving Performance on other tests unimpaired. 
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Techniques of Profile Analysis. Wechsler first discussed the diagnostic use 
of his scales in the second edition of the Wechsler-Bellevue manual, pub- 
lished in 1941. The 1958 edition of this book, dealing with the WAIS (74), 


contains a revised and expanded treatment of such a diagnostic use of the 


scales. Another similar system for clinical interpretation of Wechsler scores 
was proposed by Rapaport (56). Still other clinicians have recommended 
other techniques and modifications (cf. 30, 55). All these techniques are 
based essentially on the individual’s relative performance on different sub- 
tests. The fact that raw scores on all Wechsler subtests are transmuted into 
standard scores permits direct comparisons among them and has undoubtedly 
encouraged the development of an overabundance of diagnostic indices. 


Specifically, these indices utilize any one of three procedures: measuring 


amount of scatter, analyzing score patterns, and computing a deterioration 


index, 

Scatter is simply the ext 
the eleven subtests. Wechsler 
finding the average deviation 
ject’s own mean. The underlyin 
AD should be larger in pathological th 
this hypothesis with data on small m 


normals, the former showing significantly greater AD's. : 
Both Wechsler (74, Ch. 1D and Rapaport (56) have described what 


they consider characteristic score patterns for various clinical syndromes. 
Wechsler provides such patterns for organic brain disorders, schizophrenia, 
anxiety states, juvenile delinquency, and mental deficiency. Each pattern. is 
expressed in terms of the position of each subtest with reference to the in- 
dividual's mean on all eleven subtests. These patterns are supplemented with 
à number of special diagnostic signs associated with each syndrome. For 
example, some of the characteristic signs of schizophrenia (74, p. 171) 


listed by Wechsler are: 


ent of variation among the individual's scores on 
(74, p. 162) proposes that it be measured by 
(AD) of the eleven scores around the sub- 
g rationale of scatter indices implies that the 
an in normal cases. Wechsler illustrates 
atched groups of schizophrenics and 


Sum of Picture Arrangement plus Comprehension less than Inf 
lock Design 


Object Assembly much below Block Design 
Very low Similarities with high Vocabulary and Information 
ust be answered prior to the application of 


nd diagnostic signs pertains to the minimum 
l significance. With the reliability co- 


stion that m 
atterns a 
or statistica 


Obviously, a que 
Such a system of score p 
Score difference required f 
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efficients obtained in the standardization sample, it is possible to compute 
for every pair of WAIS subtests the smallest difference that would be signifi- 
cant at any desired probability level. A table giving these differences at the 
15 per cent level is reproduced by Wechsler (74, p. 164). The minimum 
values found in this table for different test pairs vary from 2 to 4 scaled 
Score points, most comparisons requiring 3 points for a significant difference. 
It should be noted, however, that this level of significance permits 
greater probability of error (15 per cent) than the customa 
1 per cent levels. 

Another type of intertest comparison 
computation of a deterioration index (74, 
by the observation that in the standardi 
decrement varied with the subtest. Test 
learning showed less decline than those 
the perception of new relations in verbal 
such findings, Wechsler selected a set of 
age decline and a set of “Don’t Hold" 


a much 
ry 5 per cent or 


proposed by Wechsler requires the 
Ch. 12), This index was suggested 
zation sample the amount of age 
s requiring the utilization of past 
involving speed, new learning, and 
or spatial content. On the basis of 
"Hold" tests exhibiting little or no 


tests manifesting relatively steep 
decline. These tests are listed below: 
"HOLD" TESTS "DON'T HOLD” TESTS 
Vocabulary Digit Span 
OU " Similarities 
ject Assembly Digit Symbol 
Picture Completion I i 


Block Design 


four Hold tests, and dividing this difference by the sum of the Hold tests, as 
shown below: 


Hold 


the same differential loss on WAIS 
with advancing age. To allow for n 


tive tables for each age level. Hence th 
each subtest is compared with that of his age peers, 

Critical Evaluation of Profile Analysis, The diagnostic inte tation of 
Wechsler profiles by any of the above Procedures has been siti. iticized 
from a number of angles (30, 44. 46, 55). The reliabilities Pein al- 
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though higher in the WAIS than in the earlier Wechsler-Bellevue, are still 
not high enough to permit confident interpretation of any but the largest dif- 
ferences. For instance, to be significant at the .01 level, the difference be- 
tween Arithmetic and Comprehension must be at least 5 points (46). A 
related question concerns the frequency with which intertest differences of 
this magnitude occur in the normal population. Such frequencies can readily 
be determined for the standardization sample on the basis of the intercor- 
relations of subtests. Thus McNemar (46) has shown that in the above 
Arithmetic-Comprehension comparison, 10 per cent of the standardization 
sample yielded differences of 5 points or more. If we substitute the .15 level 


of significance proposed by Wechsler for the .01 level, the percentage of 


differences above the minimum value in the normal population would be far 


greater. 
If the Wechsler scales were to be used as a differential aptitude battery 


with normal subjects—to compare the individual's relative standing in differ- 
ent abilities—the subtests should have high reliabilities and very low inter- 
correlations. On the other hand, the rationale underlying the proposed diag- 
nostic interpretations of profile irregularities requires high reliabilities and 
high subtest intercorrelations in the normal population. In an abnormal 
sample, of course, the subtests should have high reliabilities and lower inter- 
test correlations because of the hypothesized increase in scatter of scores. 

That the differences regarded as diagnostic by Wechsler do in fact occur 
frequently in the normal population has been further noted by Jones (37), 
in reference to a statement in the WAIS manual (73, p. 18). There the num- 
ber of intertest differences exceeding 3 points and those exceeding 5 points 
in the normal population is estimated from the results obtained in the 
standardization sample. As Jones points out, however, the eleven subtests in 
any one individual’s record yield 55 possible intertest comparisons. Hence 


a difference expected in, let us say, 10 per cent of the cases in any one inter- 
tually occur about 5 times in a single individual’s 


test comparison will ac à sin 

record (10 per cent of 55 = 5.5). Rather than occurring in only 10 per 

Cent of normal persons. therefore, differences of such magnitude would be 
es in every normal person's record. 


found, on the average. 5 tim r l 
than that found in the normative sam- 


In individual cases, scatter greater àt i i 
ple may result, not from pathological conditions, but from differences in 


educational, occupational, cultural, or other background factors. Language 
handicap may account for lower Verbal than Performance scores. It will be 
recalled that skilled laborers tend to score higher on Performance than on 
Verbal Scales, unlike white-collar groups. Socioeconomic and urban-rural 
differences in subtest profile have likewise been noted. On the other hand, ad- 
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ministration of WISC and WAIS to Jewish subjects from the preschool to the 
college level revealed a significant tendency for Verbal to exceed Perform- 
ance IQ (42). This difference, which increased with age, was attributed to the 
cumulative effect of Jewish cultural values emphasizing verbal abilities. To be 
sure, Wechsler calls attention to the need for considering background factors 
in scatter analysis (74, p. 160). But in the zeal to apply diagnostic signs and 
indices, such precautions are easily forgotten. 

The assumption that those subtests showing most normal age decrement 
are most sensitive to pathological deterioration, which underlies the com- 
putation of a deterioration index, is also questionable. The fact that the older 
subjects in the standardization sample had received less education than the 
younger makes the interpretation of the observed decrement even more sus- 
pect. Furthermore, some of the Hold tests may simply have more flexible 
scoring standards and may for this reason be less sensitive to change. Some 
evidence is available to suggest that the apparent resistance of vocabulary 
tests to decline may be attributable to the substitution of poorer but equally 
acceptable definitions on the part of deteriorated subjects (13, 21). 

Apart from theoretical considerations, extensive data gathered by other 
investigators have failed to corroborate the various hypotheses regarding 
diagnostic interpretations of score profiles. That the evidence is predomi- 
nantly negative is apparent from a survey of the published literature (cf. 
30, 55, 69). The original hypotheses were derived either from uncontrolled 
clinical observations or from comparisons of pathological and control groups 
that were not equated in age, education, or other factors. Failure to cross- 
validate likewise accounts for a number of spurious differences. Subsequent 
studies offer little empirical support to the hypotheses regarding scatter and 
score patterns. Similarly, mental defectives could not be differentiated from 
psychotics on the basis of the deterioration index, Schizophrenics obtained 
no higher deterioration index than neurotics, and patients with brain damage 
no higher index than those with functional disorders. Even more telling ref- 
utation was provided by longitudinal analyses of retest records, which failed 
to reveal a significant relation between actual decline in test scores and the 
deterioration index (48, 66). Also relevant are the observations that “NOt 
mal" deterioration is not found among older institutionalized mental de- 
fectives (9), and that vocabulary scores decline in patients hospitalized for 
long periods (78). 

On the other hand, there is some evidence in recent factorial analyses 
that supports the presence of a common “deterioration factor" through some 
of the Wechsler subtests (63). The same research indicated a possible rela- 
tion between performance on certain WAIS subtests and indices of brai” 
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functioning derived from electroencephalography. It was also found that 
relatively low Digit Span scores occurred more often among brain damaged 
than among normal subjects (60). It is possible that further research will 
justify the utilization of at least some score patterns on the Wechsler scales 
for diagnostic purposes. 

Like any individual intelligence test, the Wechsler scales can theoretically 
provide information at various levels. At the most objective level, these 
scales yield an IO with high reliability and fair evidence of validity. At a 
purely qualitative level, any irregularity of performance should alert the 
clinician to look for peculiarities of past experience, emotional associations, 
and other individual factors. Bizarreness, overelaboration, or excessive self- 
reference in responses are often indicative of personality disorders. Even 
when correct, specific responses may provide promising leads. As Wechsler 
points out (74, p. 181), if in the Vocabulary test one individual defines 
"sentence" as a group of words and another as a penalty imposed by a judge, 
this difference may furnish a clue to important dissimilarities in background 
or personality. 

Such qualitative interpretations, however, are tentative and require fur- 
ther verification. The same response—or score pattern—may have different 
meanings for different subjects. It may have deep significance for some and 
only trivial connotations for others. Because of their idiosyncratic meanings, 
such cues cannot be validated by quantitative methods adapted to group 
trends. Between these two extremes—from the quantitative, objective IO 
to purely qualitative observations—fall the various attempts at semiquan- 
titative profile analysis. It is at this intermediate level that both theoretical 
analyses and empirical results have so far lent little or no support to the 


Proposed interpretations. 


OTHER TESTS OF INTELLECTUAL IMPAIRMENT 


s have been specially designed as clinical instruments for 
mpairment. Several have been described as indices of 
"organicity" or brain damage. Most available tests in this category, however, 
are broader in function, and are employed to detect intellectual deteriora- 
tion or impairment arising from a variety of possible causes. Thig many test- 
ing techniques proposed for this purpose have been surveyed in a number of 
books and articles (35, 39. 52, 57, 67, 69, 75, 76. 77). Several are in an 
experimental stage. Some are designed only to afford on PPP an oe 
portunity for qualitative observations. Others have relatively standardized 
procedure and objective norms, although the normative samples are often 


A number of test 
assessing intellectual i 
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inadequate. Some are subject to one or more of the theoretical objections 
raised against profile analysis of Wechsler scores. A few have yielded prom- 
ising empirical data. 

All available tests of intellectual impairment are based on the premise 
of a differential deficit in different functions. Chief among the functions con- 
sidered to be most sensitive to pathological processes are memory, spatial 
perception, and abstraction or concept formation. Typical examples of tests 
in these three areas will be illustrated in this section. 

Memory Tests. As early as 1930, Babcock (4) proposed an index of deteri- 
oration based on the principle of "Hold" and “Don’t Hold" tests. In the sub- 
sequently developed Babcock-Levy Test (5), Stanford-Binet vocabulary 
score furnishes the estimate of previous ability, and deterioration is measured 
with tests of memory, simple learning, and motor speed. The Hunt-Minne- 
sota Test (34) likewise employs Stanford-Binet vocabulary as a point of 
reference for evaluating performance on six memory tests. Certain tests 
require the reproduction of designs, thus detecting disturbances in both mem- 
ory and spatial perception. 

An example of the latter type of test is the Benton Visual Retention 
Test (10). In this test, 10 designs of increasing complexity are individually 
presented on cards, the subject being instructed to reproduce each design 
immediately upon its removal. Performance is scored in terms of number of 
correct reproductions and number of errors. Other equivalent series of draw- 
ings are available for administration with shorter exposure, with a fifteen- 
second delay, and as a copying test. Scores are interpreted in terms of the 
subjects IO as determined by any standard verbal-type intelligence test- 
If this is unavailable, educational or vocational data are utilized as a rough 
estimate of the subject’s previous intellectual level. 

Perceptual Tests. Tasks involving the perception of spatial relations have 
been used for a long time as qualitative, informal tests of brain damage. They 
were included, for example, in early test series for the examination of 
aphasics, and are still employed for this purpose. Frequently, such tests re- 
quire the subject to copy simple designs. Some investigators have suggested 
that tests of this type might reveal not only perceptual disorders, but also 
disturbances in the subject's attitude toward the task (cf., e.g., 53). If that 
is the case, then these tests might have a broader applicability in the detec 
tion of a variety of psychological disorders. 

An example of a clinical test in this category is the Bender Visual Motor 
Gestalt Test, commonly known as the Bender-Gestalt Test (8). In this test 
the nine simple designs shown in Figure 62 are presented individually on 
cards. The subject is instructed to copy each design, with the sample before 
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him. The designs were selected by Bender from a longer series originally 
employed by Wertheimer, one of the founders of the Gestalt school, in his 
studies of visual perception. The particular designs were constructed so as to 
illustrate certain principles of Gestalt psychology, and Bender's own analyses 
of the test results are formulated in terms of Gestalt concepts. Although for 
many years the test was administered by Bender and others to children and 
adults showing a variety of psychological disorders, the data were not re- 
ported in objective and systematic form and were therefore difficult to evalu- 


ate. 


Fig. 62. The Bender-Gestalt Test. ( 
Lauretta Bender.) 


From 8, p. 4; reproduced by permission of 


scal and Suttell (53) undertook a standardization and 
quantification of the Bender-Gestalt Test on an adult population. On the 
basis of the drawing errors that significantly differentiated between matched 
Samples of normals and abnormals, a relatively objective scoring key was 
developed. Cross-validation of this key on new samples of 474 “non-pa- 
tients” (or normal controls), 187 neurotics, and 136 psychotics yielded the 
distributions shown in Figure 63. It can be seen that as a group the psychotics 
and neurotics are clearly differentiated from the controls, the mean scores of 
the three groups being 81.8, 68.2, and 50, respectively. These scores are 
Standard scores with a mean of 50 and an SD of 10, the higher scores in- 


. " i i ati oai 
dicating more diagnostic errors. The biserial correlation of test scores against 


More recently, Pa 
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the criterion of patient versus non-patient status was .74. This correlation 
may be regarded as a measure of concurrent validity. In a later study with 
small groups of children, the test significantly distinguished normals from 
schizophrenics and mental defectives (26). 
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Standard Score on Bender-Gestalt Test 


Fig. 63. Percentage Distribution of Psychotic and Neurotic Patients and Control 
“Non-Patient” Group on Pascal-Suttell Standardization of Bender-Gestalt Test. (Dat 
from Pascal and Suttell, 53, p. 30.) 


Retest reliabilities of about .70 were found in normal samples over a 
twenty-four-hour interval. Scorer reliabilities of approximately .90 are re- 
ported for trained scorers. Performance on the test is apparently independ- 
ent of drawing ability, but is significantly related to amount of education (53) 
and to mental age (7, 22). In this adaptation, the Bender-Gestalt Test ap^ 
pears to have promise as a rapid screening device, especially for detecting 
the more serious forms of disturbance. The normative sample, however. 8 
rather restricted geographically, educationally, and in other ways. Extension 
and revision of the norms on the basis of a larger and more representative 
sample would be desirable. Further checking of validity in different sam" 
ples and against other criteria would likewise be of interest. 

Another example of a perceptual test is provided by the Grassi Block 
Substitution Test (29) which represents a special adaptation and extension o 
the Kohs Block Design. While still in a research stage, this test shows pro^ 
ise, especially as an indicator of brain damage. Mention should also be 
made of the investigations that have tried to establish “signs of organicity 
for the Rorschach inkblot test. Cross-validation has generally failed to sub: 
stantiate the diagnostic value of the proposed signs. The Rorschach test 


The Wechsler Scales and Other Clinical Instruments 329 


which may be regarded as a highly unstructured sort of perceptual test, will 
be discussed in Chapter 20, which deals with projective tests of personality. 

Concept Formation Tests. Clinical tests of conceptual thinking are more 
concerned with the methods employed by the subject than with the end re- 
sult achieved. Largely for this reason, such tests lean heavily upon qualita- 
tive observations and have so far made little use of objective scoring and 
standardized norms. It is recognized, of course, that normative data would be 
helpful even in the interpretation of qualitative observations, but such data 
are difficult to obtain. 

A well-known series of concept formation tests was developed by Gold- 
stein and Scheerer (27). This series includes an adaptation and extension of 
the Kohs Block Design, another design test in which the subject copies sim- 
ple geometric patterns with sticks and later reproduces them from memory, 
and several sorting tests. An example is the color-form sorting test, in which 
ng in both shape and color are to be sorted twice in different 


pieces differi 
interest centers on the subject's ability to shift spontane- 


ways. In this test, 
ously from one basis of classification to the other. 


Fig. 64. Hanfmann-Kasanin Concept Formation Test. (Courtesy C. H. Stoelting 
Company.) i . 

A somewhat more standardized object-sorting test is the Hanfmann- 
Kasanin Concept Formation Test (32). In this test, the subject is given the 


blocks pictured in Figure 64. These 22 blocks vary in color, shape, height, 
and surface size. A category name (nonsense syllables such as inus) 

printed on the underside of each block, not visible to the subject in the in- 
itial presentation. The subject's task is to discover how the blocks should be 
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classified into four groups, each corresponding to one of the four category 
names. At the outset, the examiner picks up a sample block, shows the sub- 
ject the name on the underside, and asks him to select all blocks belonging in 
that class. When the subject has made an error in grouping blocks, the error 
is shown to him by reversing the block and exhibiting the class name. With 
the aid of these clues, the subject works until he arrives at the correct solu- 
tion. He is then asked to state the principle of classification and to re-sort the 
blocks. The Hanfmann-Kasanin test is more difficult than the sorting tests 
in the Goldstein-Scheerer series and is not so satisfactory with persons of low 
intellectual level. Scoring is quite complex and requires qualitative judgments. 

Other sorting tests on which research is being conducted include the Kahn 
Test of Symbol Arrangement (38) and the Wisconsin Card-Sorting Test 
(12, 23, 28). The latter is unusual in the objectivity of its administration 
and scoring and in the number of well-controlled experiments performed 
with it. In studies with schizophrenic, brain-injured, and mentally defective 
subjects, it has shown considerable promise as a diagnostic tool. 

It should be noted that many of the tests cited in this chapter show 2 
close kinship with personality tests, to be discussed in Part 4. Clinical tests 
provide a particularly clear illustration of the arbitrary distinction between 
personality and ability tests. At least some of the techniques considered in 


this chapter could have been included in one of the chapters of Part 4. They 


have been treated at this point because, in either their construction or their 


application, they resemble ability tests somewhat more closely than they do. 
personality tests. But the differentiation is minor and i 


s made solely for ease 
of discussion. 
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PART 3 ——_ 


Differential Testing of Abilities 


CHAPTER ]3 


Multiple Aptitude Batteries 


One of the chief distinguishing features of contemporary psychological 
testing is its differential approach to the measurement of ability. To be sure, 
the general intelligence tests discussed in Part 2 are still widely used in pre- 
liminary screening and in the clinical examination of extreme deviates. The 
period since World War II, however, has witnessed a rapid increase in the 
development and application of instruments that permit an analysis of per- 
formance with regard to different aspects of intelligence. Such instruments 
yield, not a single, global measure such as an IQ, but a set of scores in differ- 
ent aptitudes. They thus provide an intellectual profile showing the individ- 
ual’s characteristic strengths and weaknesses. 

A number of events have contributed to the growing interest in differential 
aptitude testing. First, there has been an increasing recognition of intra-indi- 
Vidual variation in performance on intelligence tests. Crude attempts to com- 
pare the individual's relative standing on different subtests or item groups 
antedated the development of multiple aptitude batteries by many years. As 
has been repeatedly pointed out, however, intelligence tests were not de- 
Signed for such a purpose. The subtests or item groups are often too unreli- 


able to justify intra-individual comparisons. In the construction of intelli- 


gence tests, moreover, items or subtests are generally chosen to provide a 


unitary and internally consistent measure. In such a selection, an effort is 
therefore made to minimize, rather than maximize, intra-individual variation. 


Subtests or items that correlate very low with the rest of the scale would, in 


general, be excluded. Yet these are the very parts that would probably have 
been retained if the emphasis had been on the differentiation of abilities. 
Another illustration of the same trend is to be found in the more defensible 


Practice of reporting two scores on intelligence tests, such as verbal and 
numerical, linguistic and quantitative, verbal and non-verbal, etc. 
, linguis 


The development of multiple aptitude batteries has been further stimulated 

by the gradual realization that so-called general intelligence tests are in 
gra 
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fact less general than was originally supposed. It soon became apparent that 
many such tests were primarily measures of verbal comprehension. Certain 
areas, such as that of mechanical abilities, were usually untouched, except 
in some of the performance and non-language scales. As these limitations of 
intelligence tests became evident, psychologists began to qualify the term 
“intelligence.” Distinctions between *academic" and "practical" intelligence 
were suggested by some. Others spoke of "abstract," “mechanical,” and 
"social" intelligence. Tests of "special aptitudes" were likewise designed is 
supplement the intelligence tests. But closer analysis showed that the intelli- 
gence tests themselves could be said to measure a certain combination of 
special aptitudes, such as verbal and numerical aptitudes, although the 
area covered by such tests was loosely and inconsistently defined. 

A strong impetus to differential aptitude testing was also provided by the 
growing activities of psychologists in vocational counseling, as well as !n 
the selection and classification of industrial and military personnel. The early 
development of specialized tests in clerical, mechanical, and other vocational 
areas is a reflection of such interests, The assembling of test batteries for the 
selection of applicants for admission to schools of medicine, law, engineering, 
dentistry, and other professional fields represents a similar development 
which has been in progress for many years. Moreover, a number of differen- 
tial aptitude batteries, such as those prepared by the Air Force and by the 


United States Employment Service, were the direct result of vocational 
selection or classification work. 


Finally, the application of factor ana 
tion provided the theoretical basis for the construction of multiple aptitude 
batteries. Through such factorial techniques, the different abilities looselY 
grouped under "intelligence" could be more systematically identified, sorted. 
and defined. Tests could then be selected so that each represented the besi 
available measure of one of the traits or factors identified by factor analysis- 


lysis to the study of trait organiza- 


FACTOR ANALYSIS 


The principal object of factor analysis is to simplify the description of 
data by reducing the number of necessary variables or “dimensions.” Such 
reduction was illustrated in the brief introduction to factor analysis given P 
Chapter 6. Thus if we find that five factors are suffic 
the common variance in a battery of 20 tests, w 
stitute 5 scores for the original 20 without sacr 
tion. The usual practice is to retain from amon 
viding the best measures of each of the factors. 


ient to account for ? 

€ can for most purposes sub- 
ificing any essential inform?" 
g the original tests those p°” 
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All techniques of factor analysis begin with a complete table of intercor- 
relations among a set of tests. Such a table is known as a correlation matrix. 
Every factor analysis ends with a factor matrix, i.e. a table showing the 
weight or loading of each of the factors in each test. A hypothetical factor 
matrix involving only two factors is shown in Table 19. The factors are listed 
across the top and their weights in each of the 10 tests are given in the appro- 
priate rows. 


TABLE 19. A Hypothetical Factor Matrix 


Test Factor I Factor II 

1. Vocabulary 74 54 
2. Analogies 64 39 
3. Sentence Completion .68 43 
4. Disarranged Sentences .32 23 
5. Reading Comprehension -70 .50 
6. Addition 92 sai 
7. Multiplication 40 —.50 
8. Arithmetic Problems 52 —.48 
9. Equation Relations 43 —.37 

32 —.25 


10. Number Series Completion 


or analyzing a set of variables into common 


Several different methods f 
01, Pearson (32) pointed the way 


factors have been derived. As early as 19 
for this type of analysis. T. L. Kelley (29) and Thurstone (42) in America 


and Burt (7) in England did much to advance the method. Alternative 
procedures, modifications, and refinements have been developed by many 
others. The availability of electronic computers is rapidly leading to the 
adoption of more refined and laborious techniques. Although differing in 
their initial postulates, most of these methods yield similar results. Cur- 
rently the most widely used technique, especially in America, is the centroid 
method formulated by Thurstone (42). The factor matrix given in Table 19 
is typical of those found by this method. For brief and simple introductions 
to factorial procedures, the student is referred to Guilford (23, Ch. 16) and 
Adcock (1). A more detailed treatment of the methodology of factor analysis 
can be found in Fruchter (19). 

It is clearly beyond the scope of this book to cover the mathematical basis 
or the computational procedures of factor analysis. But an understanding of 
the results of factor analysis need not be limited fo those who have mastered 
its specialized methodology. Even without knowing how the factor loadings 
ed. the student will be able to see how a factor matrix is utilized 


nd for a fuller discussion of the methods, problems, and results of 


1 For specific references a i 
factor analysis, see Anastasi (3, Chs- 10 and 11). 


were comput 
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in the identification and interpretation of factors. For an intelligent reading 
of reports of factorial research, however, familiarity with a few other con- 
cepts and terms is helpful. 

It is customary to represent factors geometrically as reference axes in 
terms of which each test can be plotted. Figure 65 illustrates this procedure. 


1l 
1.0 


Fig. 65. A Hypothetical Fa " ae : 4 
Mert - e d etical Factor Pattern, Showing Weights of Two Group Factors in 


In this graph, each of the 10 tests from Table 19 has been plotted against the 
two factors, which correspond to axes I and II. Thus the point representing 
Test 1 is located by moving .74 of the distance along axis I and .54 of the 
distance along axis II. The points corresponding to the remaining 9 tests are 
plotted in the same way, using the weights given in Table 19 Although al 
the weights on Factor I are positive, it will be noted that on Factor II some of 
the weights are positive and some negative. This can also be seen in Figure 65. 
where Tests 1 to 5 cluster in one part of the graph and Tests 6 to 10 7" 
another. 

In this connection it should be noted that the position of the reference 
axes is not fixed by the data. The original correlation table determines only 
the position of the tests (points in Figure 65) in relation to each other. The 
same points can be plotted with the reference axes in any position. For this 
reason, factor analysts usually rotate axes until they obtain the most satis" 
factory and easily interpretable pattern. This is a legitimate procedures 
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somewhat analogous to measuring longitude from, let us say, Chicago rather 
than Greenwich. 

The reference axes in Figure 65 were rotated to positions I’ and II’, 
shown by the broken lines. This rotation was carried out in accordance with 
Thurstone’s criteria of positive manifold and simple structure. The former 
requires the rotation of axes to such a position as to eliminate all significant 
negative weights. Most psychologists regard negative factor loadings as 
inapplicable to aptitude tests, since such a loading implies that the higher the 
individual rates in the particular factor, the poorer will be his performance 
on the test. The criterion of simple structure means that each test shall have 
loadings on as few factors as possible. Both of these criteria are designed to 
yield factors that can be most readily and unambiguously interpreted. 

It will be seen that on the rotated axes in Figure 65 all the verbal tests 
(Tests 1 to 5) fall along or very close to axis I’. Similarly, the numerical 
tests (Tests 6 to 10) cluster closely around axis II’. Parenthetically, the 
ted axis II’ should have been labeled —II', to corre- 
xis —1I. Which pole of the axis is labeled plus and 
bitrary matter. In the present example, the 
rotated axis II’ has been “reflected” in order to eliminate negative weights. 

The new factor loadings, measured along the rotated axes, are given in 
Table 20. The reader may easily verify these factor loadings by preparing a 


reader may feel that rota 
spond to the unrotated a 
which minus, however, is an ar 


TABLE 20. Rotated Factor Matrix 
(Data from Figure 65) 


Test Factor I’ Factor II’ 

1. Vocabulary 2 =p 
2, Analogies ! p 2 
3. Sentence Completion "S = 
4. Disarranged Sentences EA =02 
5. Reading Comprehension oa E 
6. Addition dj Rn 
7. Multiplication 4 i 

8. Arithmetic Problems 2 p^ 
9. Equation Relations Pn = 


Number Series Completion 


paper "ruler" with a scale of units corresponding to that in Figure 65. With 
this “ruler,” distances can be measured along the rotated axes. The factor 
loadings in Table 20 include no negative values except for very low, negligi- 
butable to sampling errors. All of the verbal tests have high 
actically zero loadings on Factor I’. The nu- 
d, have high loadings on Factor Il’ and low, 


ble amounts attri 
loadings on Factor J’ and pr 
merical tests, on the other han 
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negligible loadings on Factor I’. The identification and naming of the two 
factors and the description of the factorial composition of each test have thus 
been simplified by the rotation of reference axes. In actual practice, the 
number of factors is often greater than two—a condition that complicates the 
geometrical representation as well as the statistical analysis, but does not 
alter the basic procedure. The fact that factor patterns are rarely as clear-cut 
as the one illustrated in Figure 65 adds further to the difficulty of rotating 
axes and of identifying factors. 

Once the rotated factor matrix has been computed, we can proceed with 
the interpretation and naming of factors. This step calls for psychological in- 
sight rather than statistical training. To learn the nature of a particular factor, 
we simply examine the tests having high loadings on that factor and we try 
to discover what psychological processes they have in common. The more 
tests there are with high loadings on a given factor, the more clearly we 
can define the nature of the factor. In Table 20, for example, it is apparent 
that Factor I’ is verbal and Factor II’ is numerical. 

Factor loadings also represent the correlation of each test with the factor. 
It will be recalled that this correlation is the factorial validity of the test (Ch. 
6). From Table 20 we can say, for instance, that the factorial validity of the 
Vocabulary test as a measure of the verbal factor is .91. The factorial 
validity of the Addition test, in terms of the numerical factor, is 55. Obvi- 
ously the first five tests have negligible validity as measures of the numerical 
factor, and the last five have practically no validity 
verbal factor. The concept of factorial validity is especially relevant to the 
type of tests to be discussed in this chapter. 

Another type of information that can be obtained from a factor matrix is 
the proportional contribution of each factor to the total variance of a test 
This contribution is simply the square of the factor loading. For example. if 
we examine the loadings of Test 8, Arithmetic Problems, in Table 20. W° 
can see that 3 per cent of its variance (.182 — .03) is attributable to th? 


verbal factor and 46 per cent (.68? = .46) to the numerical factor. The 
total variance of this test attributable to common f 


sta actors is thus 49 per cent 
(.46 + .03 = .49). This is known as its communality. Let us now suppos? 
that the reliability of this test had been found to be .85. We can then conclude 


that 15 per cent of the test variance is error variance. So far we hav? 
accounted for 64 per cent of the test variance (.49 + = .64). The 1 
maining 36 per cent is the specificity of the test, covering any specific factors 
occurring only in this test. The variance of any test can bé broken down in is 


similar fashion, if we know its common factor loadings and its reliability 
coefficient. 


as measures of the 
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The axes employed in Figure 65 are known as orthogonal axes, since they 
are at right angles to each other. Occasionally, the test clusters are so situated 
that a better fit can be obtained with oblique axes. In such a case, the 
factors would themselves be correlated. Some investigators have maintained 
that orthogonal, or uncorrelated, factors should always be employed, since 
they provide a simpler and clearer picture of trait relationships. Others insist 
that oblique axes should be used when they fit the data better, since the 
most meaningful categories need not be uncorrelated. An example cited by 
Thurstone is that of height and weight. Although it is well known that 
height and weight are highly correlated, they have proved to be useful cate- 
gories in the measurement of physique. 

When the factors are themselves correlated, it is possible to subject the 
intercorrelations among the factors to the same statistical analyses we em- 
ploy with intercorrelations among tests. In other words, we can “factorize 
the factors" and derive second-order factors. This process has been followed 
in a number of studies with both aptitude and personality variables. Certain 
investigations with aptitude tests have yielded a single second-order general 
factor. As a rule, American factor analysts have proceeded by accounting 
for as much of the common variance as possible through group factors and 
then identifying a general factor as a second-order factor it the data justified 
it. British psychologists, on the other hand, usually begin with a general 
factor, to which they attribute the major portion of the common variance, and 
then resort to group factors to account for any remaining correlation. These 
procedural differences reflect differences in theoretical orientation to be dis- 


cussed in the following section. 


THEORIES OF TRAIT ORGANIZATION 


The Two-Factor Theory. The first theory of trait organization based upon a 
t scores was the Two-Factor theory developed by the 
arles Spearman (34, 35). In its original formulation, 
hat all intellectual activities share a single common 
' or g. In addition, the theory postulated 
being strictly specific to a single activity. 
o functions was thus attributed to the g 


statistical analysis of tes 
British psychologist, Ch 
this theory maintained t 
factor, called the "general factor,’ 
numerous specific or s factors, each 


Positive correlation between any tw à s EN 
factor The more highly the two functions were "saturated" with g, the 


higher would be the correlation between them. The presence of specifics, on 
the other hand, tended to lower the correlation between functions. 

It follows from the Two-Factor theory that the sim of psychological 
testing should be to measure the amount of each individual's g. If this factor 
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runs through all abilities, it furnishes the only basis for prediction of the 
subject's performance from one situation to another. It would be futile to 
measure specific factors since each by definition operates in only a single 
activity. Accordingly, Spearman proposed that a single test, highly satu- 
rated with g, be substituted for the heterogeneous collection of items found in 
intelligence tests. He suggested that tests dealing with abstract relations are 
probably the best measures of g and could be used for this purpose. Examples 
of tests constructed as measures of g include Raven's Progressive Matrices 
and the IPAT Culture Free Intelligence Test, both discussed in Chapter 10. 

From the outset, Spearman realized that the Two-Factor theory must be 
qualified. When the activities compared are very similar, a certain degree of 
correlation may result over and above that attributable to the g factor. Thus 
in addition to general and specific factors, there might be another, intermedi- 
ate class of factors, not so universal as § nor so strictly specific as the s factors. 
Such a factor, common to a group of activities but not to all, has been desig- 
nated as a group factor. In the early formulation of his theory Spearman ad- 
mitted the possibility of very narrow and negligibly small group factors. 
Following later investigations by several of his students, he included much 
broader group factors such as arithmetic, mechanical, and linguistic abilities. 

Multiple-Factor Theories. The prevalent contemporary scrim view of 
trait organization recognizes a number of moderately broad group factors; 
each of which may enter with different weights into different tests. For €x- 
ample, a verbal factor may have a large weight in a vocabulary test, 4 
smaller weight in an analogies test, and a very small weight in an arithmetic 
reasoning test. The publication in 1928 of Kelley's Crossroads in the Mind 
of Man (28) paved the way for a large number of studies in quest of partic 
ular group factors. Chief among the factors proposed by Kelley were manipu- 
lation of spatial relationships, facility with numbers, facility with verbal 
material, memory, and speed. This list has been modified and extended bY 
subsequent investigators employing the more modern methods of facto" 
analysis discussed in the preceding section. 

One of the leading exponents of Multiple-Factor theory was Thurstone: 
On the basis of extensive research by himself and his students, Thurstone 
proposed about a dozen group factors which he designated as “primary 
mental abilities.” Those most frequently corroborated in the work of Thurs- 
tone and of other independent investigators (16, 24, 39, 43) include the 
following: 


V. Verbal Comprehension: The principal factor in such tests as reading COT" 
prehension. verbal analogies, disarranged sentences, verbal reasoning. 2. 
proverb matching. It is most adequately measured by vocabulary tests. 
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W. Word Fluency: Found in such tests as anagrams. rhyming, or naming words 
in a given category (e.g., boys’ names or words beginning with the same 


letter). 


N. Number: Most closely identified with speed and accuracy of simple arith- 
metic computation. 


S. Space: It is possible that this factor may represent two distinct factors, one 
covering perception of fixed spatial or geometric relations, the other manipu- 
latory visualization, in which changed positions or transformations must be 
visualized. There is also evidence of a third factor of "kinaesthetic im- 


agery" (31). 


M. Associative Memory: Found principally in tests demanding rote memory for 
paired associates. There is some evidence to suggest that this factor may re- 
flect the extent to which memory crutches are utilized (12). The evidence is 
against the presence of a broader factor through all memory tests. Other re- 
stricted memory factors, such as memory for temporal sequences and for 
spatial position, have been suggested by some investigations. 


P. Perceptual Speed: Quick and accurate grasping of visual details, similarities, 
and differences. This factor may be the same as the speed factor identified 


by earlier investigators. This is one of several factors subsequently identified 


in perceptual tasks (41). 


1 (or R). Induction (or General Reasoning): The identification of this factor 
was least clear. Thurstone originally proposed an inductive and a deductive 
factor. The latter was best measured by tests of syllogistic reasoning and the 
former by tests requiring the subject to find a rule, as in a number series com- 
pletion test. Evidence for the deductive factor, Mower els was much weaker 
than for the inductive. Moreover, other investigators suggest a general reas- 
oning factor, best measured by arithmetic reasoning tests (30). 


It should be noted that the distinction between general, group, and 
Specific factors is not so basic as may at first appear. If the number or variety 
Of test i b ttery is small, a single general factor may account for all the 

a a S E : 3 
Soela : si iia But when the same tests are included in a larger 
ions a ` 


batte ith more heterogeneous collection of tests, the original general 
attery with a E 

a : as a group factor, common to some but not all tests. 
a è 


ay be represented bv only one test in the 
shared by several tests in the larger battery. 
Such a factor would have been identified as a specific ia the original battery, 
but would become a group factor in the more € n battery: In the 
light of these considerations, it is not Sepse to i t ie es ing factorial 
investigations of special areas have yielded many factors in place of the one 


factor may emerge 
Similarly, a certain factor m 
Original battery, but may be 
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or two primary mental abilities originally identified in each area. Such has 
been the case in studies of verbal (11), perceptual (41), memory (12), and 
reasoning (21) tests. 

Factorial research seems to have produced a bewildering multiplication of 
factors. On the basis of his own and other published studies, Guilford (24, 
25) listed some fifty factors and pointed to gaps in his schema where other 
factors may eventually be identified. He proposed a threefold model for 
the “structure of intellect.” In this model, intellectual activities are classified 
with regard to operations (cognition, memory, divergent thinking, con- 
vergent thinking, evaluation), products (units, classes, relations, systems, 
transformations, implications), and contents (figural, symbolic, semantic, be- 
havioral). Such a classification yields 120 cells, each corresponding to a po- 
tential factor. A certain amount of order has been achieved by cross-identi- 
fication of factors reported by different investigators and often given different 
names (2, 16). Such cross-identification can be accomplished when there 
are several tests common to the investigations being compared. To facilitate 
this process, a group of factor analysts assembled a kit of "reference tests" 
measuring the principal aptitude factors so far identified. This kit, which is 
distributed by Educational Testing Service (17, 18), makes it easier for 
different investigators planning factorial research to include some common 
tests in their batteries. 

It is apparent that even after these efforts at simplification and coordina- 
tion, the number of factors remains large. Human behavior is varied and 
complex, and perhaps it is unrealistic to expect a dozen or so factors to 
provide an adequate description of it. For specific purposes, of course, we 
can choose appropriate factors with regard to both nature and breadth. For 
example, if we are selecting applicants for a difficult and highly specialized 
mechanical job, we would probably want to measure fairly narrow perceptual 
and spatial factors that closely match the job requirements. In selecting 
college students, on the other hand, a few broad factors such as verbal com- 
prehension, numerical facility, and general reasoning would be most rele- 
vant. Illustrations of the different ways in which factorial results have been 
utilized in test development will be given later in this chapter. 

Hierarchical Theories. An alternative schema for the organization of factors 
has been proposed by a number of British psychologists, including Burt (9) 
and Vernon (49). A diagram illustrating Vernon's application of this system 
is reproduced in Figure 66. At the top of the hierarchy, Vernon places 
Spearman's g factor. At the next level are two broad group factors, corre 
sponding to verbal-educational (v-ed) and to practical-mechanical (km) 
aptitudes, respectively. These major factors may be further subdivided. The 
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verbal-educational factor, for example, yields verbal and numerical sub. 
factors. Similarly, the practical-mechanical factor splits into ecient 
formation, spatial, and manual subfactors. Still narrower subfactors can 
be identified by further analysis, let us say, of the verbal tasks. At the lowest 
level of the hierarchy are the specific factors. Such a hierarchical stitistule 
thus resembles a genealogical tree, with g at the top, s factors at the bottom 


and progressively narrower group factors in between. 


E 


Major Group Factors i= 


ved km 


speete res ||| IIIIIIIIIIIIIIIIITHII 


ating Hierarchical Theory of Human Abilities. From Vernon 


s. Diagram Illustr j 
» p. 22. Copyright, 1960, Methuen & Co., Ltd.) 

Factors as Operational Unities. That different investigators may arrive at 
dissimilar schemas of trait organization becomes less perplexing when we 
Tecognize that the traits identified through factor analysis are simply an ex- 
pression of correlation among behavior measures. They are not underlying 
entities or causal factors, but descriptive categories. Hence it is conceivable 
that different principles of classification may be applicable to the same data. 

The concept of factors as descriptive categories is explicit in the writings 
of Thomson (37, 38), Burt (7, 8) and Vernon (49) in England, and those 
of Tryon (46) in America. All of these writers have called attention to the 
Vast multiplicity of behavioral elements, which may become organized into 
Clusters through either hereditary Or experiential linkages. For example, a 
broad verbal-educational factor is likely to develop through all activities 
learned in school. A narrower factor of numerical aptitude may result from the 
fact that all arithmetic processes are taught together by the same teacher in 
the same classroom. Hence the child who is discouraged, antagonized, or 
bored during the arithmetic period will tend to fall behind in his learning of 
all these processes; the one who is stimulated and gratified in his arithmetic 
class will tend to learn well all that is taught in that class period and to de- 
velop attitudes that will advance his subsequent numerical learning. 

lt might be added that other factor analysts have from time to time ex- 
Pressed essential agreement with these interpretations of factors. Thurs- 
tone (40), for instance, suggested that factors arë nae ie: Ho TESS, di 
ultimate psychological entities but rather as “functional unities” or aggregates 
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of more elementary components. Nevertheless, Thurstone's discussion of 
factors in other publications and his continued use of the term “primary 
mental abilities" have tended to foster the impression of factors as under- 
lying entities. 


MULTIPLE APTITUDE BATTERIES FOR GENERAL USE 


The most direct effect of factor analysis upon test construction can be 
seen in the development of multiple aptitude batteries. Among the best- 
known examples of such batteries designed for general use are the SRA Pri- 
mary Mental Abilities (PMA), Differential Aptitude Tests (DAT), Flana- 
gan Aptitude Classification Tests (FACT), Guilford-Zimmerman Aptitude 
Survey, Holzinger-Crowder Uni-Factor Tests, and Multiple Aptitude Tests 
(MAT). All of these batteries have been thoroughly reviewed in the fourth 
and fifth volumes of the Mental Measurements Yearbooks, as well as in a 
special survey edited by Super (36). The reader is urged to consult these 
sources for several independent evaluations of each instrument. This dis- 
cussion will touch upon a few highlights only. 

The publication of the original Chicago PMA tests in 1941 (44) was à 
direct outcome of Thurstone's pioneer research on the identification of the 
primary mental abilities discussed in the preceding section (39). For each 
factor, those tests with the highest factorial validities were assembled into 4 
battery requiring six testing sessions. By reducing the number of tests for 
each factor, this series was later shortened to a single-booklet, two-hour 
edition. Subsequent development included further condensation, yielding 
the SRA Primary Mental Abilities for Ages 11-17, as well as downward ex- 
tensions to include similar short batteries for ages 7-11 and 5-7 (45). Illus- 
trative items from the 11-17 battery are reproduced in Figure 67. 

Not all the primary mental abilities listed in the preceding section are 


represented in all these batteries. The factor scores provided in each battery 
are summarized below: 


Chicago PMA: f separate-booklet ed. | 


(Ages 11-17) l single-booklet ed. f M Y W N 8 R 
SRA: PMA Ages 11-17 V W N S R 
SRA: PMA Ages 7-11 V N S R P 
SRA: PMA Ages 5-7 y (Q s (Q P Mo 


In the 5-7 year battery, a rudimentary Quantitative (Q) factor replaces the 
Numerical (N) and Reasoning (R) factors found at later ages. A Motor 
(Mo) test, involving speed and accuracy in the use of a pencil, was also 
added at the 5-7 year level. With regard to content, the PMA tests at this 
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level are very similar to most group intelligence tests for the primary level 
(c£. Ch. 9). Subsequent research has shown, in fact, that cael le ces 
batteries contribute little at these ages, since the abilities of pe m dnas : 
are ‘very highly intercorrelated (3, pp. 357-360; 10; 20). In fet, it " "t 
until the high school level that differentiation of abilities has ae du 
enough to justify the practical use of multiple aptitude batteries. is ü 


VERBAL-MEANING is measured by a multiple-choice 
vocabulary test made up of the following type of item: 


ANCIENT: A. Dry B. Long C. Happy D. Old 


The word which means the same as the first word is 
to be marked. 


SPACE is measured by items similar to the one below: 
A B c D E 


E 


the row that is the same as the first figure, 
is rotated, is to be marked. Figures that 
st figure are not to be marked. 


Every figure in 
even though it 
are mirror-images of the fir: 


REASONING is measured by items like the following: 


abxcdxefxghx HOE HH 


w form a series based on a rule. 
k the letter that should come 


The letters in the ro 
The problem is to mar 
next in the series. 


NUMBER is measured by a series of addition items, such 


as the following: 
17 
si E NJ 
29 
140 
The sum of each column of figures is given, but some of 
the solutions are right and others wrong. The answer 


s to be marked right or wrong. 


NCY is measured by a test requiring the 
y words as possible beginning with 


given i 


WORD-FLUE 
writing of as man) 
a certain letter. 


of Items from the SRA Tests of Primary Mental 


and Illustration 
d by permission of Science Research Associates.) 


67. Description 
17. (Reproduce 


Abilities for Ages 11 to 


The current PMA batteries have been widely criticized because of techni- 
cal faults. The early forms were based on extensive research and repre- 
sented an important breakthrough in test construction. Rather than providing 
the needed refinement and empirical validation, however, the subsequent 
evolution of these tests has proceeded chiefly in the direction of abridgment 
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and simplification. Inadequacies of normative data, questionable types of 
scores (such as ratio IQ’s), unsupported interpretations of scores, meager 
validity data, improper procedures for computing reliability of speeded tests, 
excessive dependence of scores on speed, and low reliabilities of factor scores 
are among the chief weaknesses of these tests. In their present form, they are 
of interest primarily to illustrate the nature of the factors identified in the 
original research. 

The DAT (5) was developed principally for use in educational and voca- 
tional counseling of high school students. Designed for grades 8 through 12, 
it is also suitable for unselected adults. Although not utilizing factor analysis 
in its construction, the authors of the DAT were guided in their choice of 
tests by the accumulated results of factorial research, as well as by practical 
counseling needs. Rather than striving for factorial purity, they included tests 
that might be factorially complex if they represented well-recognized voca- 
tional or educational areas. The DAT yields the following eight scores: 
Verbal Reasoning, Numerical Ability, Abstract Reasoning, Space Relations, 
Mechanical Reasoning, Clerical Speed and Accuracy, Language Usage I— 
Spelling, and Language Usage II—Sentences, A sample item from each test 
is reproduced in Figures 68A and 68B. 

The DAT manual provides an unusually clear and thorough account of the 
test-construction procedures followed in developing the battery. Norms were 
derived from a large, representative national sample. With the exception of 
Clerical Speed and Accuracy, all tests are essentially power tests. Reliability 
coefficients are high and permit interpretation of intertest differences with 
considerable confidence. By combining information on test reliabilities and 
intercorrelations found in the standardization sample, the test authors deter- 
mined the proportion of differences in excess of chance between each pair of 
tests. This can be done by reference to the chart reproduced in Figure 69. 
For boys, the percentages of such differences ranged from 29 to 52; for girls: 
they ranged from 20 to 48. 

The amount of validity data available on the DAT is overwhelming, in^ 
cluding several thousand validity coefficients. Most of these data are con- 
cerned with predictive validity in terms of high school and college achieve- 
ment. Many of the coefficients are high, even v 


vith intervals as long as three 
years between test and criterion data. The results are somewhat less €n^ 


couraging with regard to differential prediction. Although, in general, €— 
tests correlate more highly with English courses and numerical tests wit 
mathematics courses, there is evidence of a large general factor underlying 
performance in all academic work. Verbal Reasoning, 


ives 
for example, give 
high correlations with most courses. 
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VERBAL REASONING. 

Choose the two words which will correctly fill the blanks in 
the sentence. For the first blank, select a numbered word 
(top row); for the second blank, select a lettered word 


(bottom row). 


2. queen 
B. fire 


NUMERICAL ABILITY. 
Choose the correct answer for each problem. 


Add 13 Subtract 30 
12 20 


D 59 

E none of these 
The correct answer for the first problem is B; 
for the second, E. 


ABSTRACT REASONING. 
The four “problem figures” in each row make a series. Find 


the one among the “answer figures” which would be the 
next in the series. 
PROBLEM FIGURES ANSWER FIGURES 


A B c D E 


The correct answer is B. 


SPACE RELATIONS. 
Which of the following figures could be made by folding 
the given pattern? Note the two grey sides. The pattern 


always shows the outside of the figure. 


from the Differential Aptitude Tests. (Reproduced by per- 


Fig. 68A. Sample Items y 
ration.) 


mission of The Psychological Corpo 
are reported on normalized percentile charts. Thus 
anding on different tests is not distorted by the in- 
rresponding standard scores with a mean of 
50 and an SD of 10 can also be read from the profile. charts (cf. Fig. 16, Ch. 
4). In order to facilitate the interpretation of intra-individual differences in 
test scores, the profile charts were designed so that a distance of 1 inch cor- 


Scores on the DAT 
the individual's relative St 
equalities of percentile units. Co 
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MECHANICAL REASONING. . 
Which man in this picture has the heavier load? 


CLERICAL SPEED AND ACCURACY. 
In each test item, one of the five combinations is underlined. 
Find the same combination on the answer sheet and mark it. 


TEST ITEMS SAMPLE OF ANSWER SHEET 


LANGUAGE USAGE: I. SPELLING, 
Indicate whether each word is spelled tight or 


W. man 


X. gurl 


LANGUAGE USAGE: II. SENTENCES. 
Decide which of the lette 
errors, if any, and mark 
answer sheet. 


red parts of each sentence contain 
the corresponding letters on the 


Ain't we / going to the / office / next week / at all. 
A B c D E 


68B. Sample Items from the Differential A 


ptitude Tests. (Reproduced by per- 
on of The Psychological Corporation.) 


responds to 10 standard score points.” By computing the standard errors : 
intra-individual differences between each Pair of tests (cf. Ch. 5), it can 

shown that a difference of approximately 10 standard Score points between 
any two tests is significant at the .05 level or better, On this basis it is n 
mended that if the vertical distance between any two tests on a profile 1s a 
inch or more, it may be assumed that a true difference exists. Differences be 


3 : iven Ja Figurë DO: 
? This is true in the actual profile charts, not in the Teduced reproduction given in Figure 
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tween 1⁄2 inch and 1 inch are considered doubtful; and those less than 4 inch, 
negligible. 


42 r3 T | 


Proportion of Differences Which are Significant 


IP am ii 


70 74 .78 82 86 90 94 .98 
Mean Reliability Coefficient 


Fig. 69. Chart for Determining the Proportion of Differences in Excess of Chance 
between Pairs of Tests. (From Bennett and Doppelt, 4, p. 322; reproduced by permission 
of Educational and Psychological Measurement.) 


No Tegression equations or other statistical procedures for predicting 
Specific criteria with DAT scores are provided, except for an index of over-all 
Scholastic ability. In a supplement to the Manual, published in 1958, data 
are presented to show that the sum of the Verbal Reasoning and Numerical 

bility Scores yields coefficients of .70 to .86 with Subsequent academic 
Criteria, The combination of these two scores thus functions as an especially 
valid "intelligence" test in this context. Test users may, of course, work out 
Other score combinations and regression equations in terms of other Specific 
Criteria. The test authors themselves recommend a Clinical rather than a 
Statistica] interpretation of scores (cf. Ch. 7), once the profiles have been 
Plotted. In 4 casebook entitled Counseling from Profiles (6), they report 30 
Cases actually submitted by high school counselors, together with the DAT 
Profiles, to illustrate the ways in which such profiles can be use 
'ndividua] recommendations. 

. ^ still different approach to the construction of multiple aptitude batteries 
Is illustrated by the Flanagan Aptitude Classification Tests, or FACT (15). 
“ginning as an outgrowth of Flanagan’s research on the development of Air 
Orce Classification tests during World War I1, this battery is oriented prin- 


d in making 
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cipally toward vocational counseling and employes selection Job analyses x 
many occupations led to the identification of 21 “critical job nu 

abilities differentiating successful from unsuccessful workers on each job. 4 
one job element is common to many types of jobs. Three examples = 2 
job elements are given below (from p. 10, FACT Technical Report, 1959): 


Assembly: Ability to visualize the appearance of an object assembled from a 
number of separate parts (see Fig. 70). 


Planning: Ability to plan, organize, and schedule; ability to foresee problems that 
may arise, and to anticipate the best order for carrying out the various steps. 


Ingenuity: Creative or inventive skill; ability to devise ingenious procedures. 
equipment, or presentations (see Fig. 70). 


Assembly:In each figure, parts are to be assembled so that places having the same letter iik 
put together. Which of the five assemblies below shows the parts put together correctly? 


menamas 


Ingenuity: Think of a word or words that will complete a clever and ingenious solution to 


the stated problem. Then choose the alternative having the same first and last letters and 
the same number of letter spaces as your answer. 


S1. A hostess for a children's party wanted to serve 


ice cream in an interesting manner, and she decided 
to make a clown for each child. She placed a ball of 
ice cream to represent the clown's head on a round 
cookie which served for a collar, 


and on top of this 
she inverted a 


mgog» 


Answers: Assembly, C Ingenuity, D (cone) 


^ TM 
Fig. 70. Sample Items from Flanagan Aptitude Classification Tests. (Reproduced PY 
Ms cde of John C. Flanagan.) 
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A battery of paper-and-pencil tests was prepared to test 19 of the 21 job 
elements. The remaining two, Carving and Tapping, require performance 
tests. Norms were established on a national sample of approximately 11,000 
students in grades 9 to 12. On the basis of the initial, qualitative job analyses, 
test scores were combined into 38 Occupational Aptitude Scores, including 
one score for predicting general college aptitude and scores for specific oc- 
Cupations ranging from humanities teacher to plumber. The individual tests 
have rather low reliabilities and some of the score distributions suggest inade- 
quate diflerentiations among individuals. Reliabilities of the composite Oc- 
Cupational Aptitude Scores are higher. These composite scores, however, 
show much more overlap than the individual test scores. Intercorrelations 
among the tests indicate that fairly distinct aptitudes are measured; but ap- 
parently many occupations require similar combinations of aptitudes. A 
correlation of .90 between the occupational scores for telephone operator and 
office clerk is not surprising; but a correlation of .95 between the aptitude 
Scores of airplane pilot and draftsman suggests inadequate differential valid- 
ity! 

The Occupational Aptitude Scores are being validated longitudinally, by 
following up the subsequent educational and vocational careers of high 
School students tested in the standardization sample. This represents an 
ambitious and commendable plan. Available data from a one-year follow-up 
of the present form of the battery and follow-ups of earlier forms of as much 
as five-years’ duration have so far provided moderately promising evidence 
of validity. The battery has shown good predictive validity against profes- 
Sional training criteria. But validity data in terms of job entry and advance- 
Ment are meager and sometimes difficult to evaluate because of criterion in- 
adequacies and the influence of fortuitous factors in occupational careers. 
The continuing research program carried on with the FACT battery will 
eventually permit a more definitive evaluation of its contribution. 

Another battery resulting from wartime Air Force research is the Guilford- 
Zimmerman Aptitude Survey (26). This battery, which is still in the process 
of development and standardization, provides tests for seven factors: (1) 
Verbal Comprehension, (2) General Reasoning, (3) Numerical Operations, 
(4) Perceptual Speed, (5) Spatial Orientation, (6) Spatial Visualization, and 
(7) Mechanical Knowledge. It is suggested that Tests 1 and 2 may be com- 
bined to yield an estimate of “abstract intelligence,” Tests 3 and 4 an estimate 
of “clerical aptitude,” and Tests 5, 6, and 7 an estimate of “mechanical apti- 
tude.” For counseling or personnel classification, however, the use of the 
entire battery is recommended. 
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A. distinctive feature of the Guilford-Zimmerman battery is its inclusion 
of two separate spatial scores. Spatial Orientation refers to the perception 
of spatial relations of Objects with reference to the observer’s own position. 
This ability proved important in learning to pilot a plane and may be in- 
volved in many jobs requiring the operation of machines. Spatial Visualiza- 
tion is the ability to manipulate or to transform an Object into 


attery include 


Concurrent and predictive. correlations 
With college grades and industrial criteria, The Verbal Comprehension and 


gnificant and moderately high correlations with 


grades in most courses. The Numerical and the two Spatial tests show some 


Part V. Spatial Orientation 
These are the five Possible answers to the item. 


This is the prow (front end) of 


a motor boat in which you are 
———— « riding. 


These are tiny pictures of the. 
boat's prow. 


This is the aiming point. It is 
the exact spot you would see 
on land if you sighted right 
over the point of the prow. 


This is the correct answer. It 
shows that the prow of the boat. 
has dropped below the aiming 
point. 


This is the same aiming point 
shown above. Note that the 
Prow of the motor boat has 
dropped below it. 


à 
SAMPLE ITEM 


Part VI. 
zi 


Spatial Visualization 


TURN RIGHT 
TOWARD FACE 90° 


To "turn" the clock means to swing i i 
g it around on its ba: i i 

top to bottom of the clock. Definitions of “tilt” and oh pee dL GR rorm 

illustrated. "Right" means the ri 


A Ste" are likewise iven and 

ated. ight side of the cl k : 3 
left side is near 3 o'clock. This is always maraa Ie near 9 o'clock; the 
letters R and L. n the drawings, by the 


The correct answer, B, has been marked in the sample 


Fig. 71. Sample Items from Guilford-Zimmermg i sed 1B 
permission of Sheridan Supply Company.) an Aptitude Survey, (Reproduced by 
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Significant correlations with grades in certain courses and with industrial 
criteria, but evidence for differential validity is very meager. 

The Holzinger-Crowder Uni-Factor Tests (27) have aimed more deliber- 
ately than most other batteries toward factorial purity and independence of 
factor scores. Partly as a consequence of these aims, these tests measure 
rather simple and narrowly limited functions. The battery provides four 
Scores: Verbal, Spatial, Numerical, and Reasoning. Each of the first three 
Scores is derived from two tests; the fourth from three tests. The battery is 
technically sound and the manual is unusually thorough and objective in its 
Presentation of data. Norms are based on a large, nationwide sample of 
junior and senior high school students. Regression equations are provided 
for combining the factor scores so as to predict general scholastic aptitude, as 
well as achievement in science, social studies, English, and mathematics (ck; 
Ch. 7, p. 173). Validity has been checked principally against concurrent 
Criteria of achievement tests and school grades. Some promising evidence of 
differential validity is available. 

A relatively recent addition to differential testing is the MAT (33). Em- 
Ploying Nine tests, this battery yields scores in Verbal Comprehension, Per- 
ceptual Speed, Numerical Reasoning, and Spatial Visualization. Similar in 
Purpose and approach to the DAT, this battery has not been in use long 
enough to permit an adequate judgment of its effectiveness. The tests have 
been well constructed and carefully evaluated. Good techniques for score 
reporting have been worked out, including provision for evaluating the sig- 
Nificance of differences between factor scores obtained by the same individ- 
ual. Evidence of differential validity against grades in specific courses is in- 
Conclusive, 

In commenting upon multiple aptitude batteries as a group, we may note 
first. that available instruments differ considerably in approach, technical 
Quality, and amount of available evaluative data. A common feature, how- 
ever, is their disappointing performance with regard to differential validity. 
In this connection, the student is urged to reread the section on the use of 
tests for classification decisions, given in Chapter 7. It will be noted that 
Counseling. for which most if not all the batteries discussed in the present 
Section. were developed—is essentially a classification decision, And it will 
be recalled that differential validity is the major requirement for Classification 

ütteries, 

Perhaps academic criteria, against which most multiple aptitude batteries 
have so far been validated, are not clearly differentiable on the basis of apti- 
tudes, It is possible that differences in performance in specific courses depend 
Principally on interests, motivation, and emotional factors. Unpredictable 
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contingency factors, such as interpersonal relations between an individual 
student and a particular instructor, may also play a part. With regard.to epus 
tudes, the large common contribution of verbal comprehension to achieve- 
ment in all academic areas has been repeatedly demonstrated. 
able criteria of occupational success (rather than performance in vocational 
training) can be developed, it is likely that better differential predictions can 
be made with batteries designed for this purpose. In terms of avail 
however, multi-factor batteries have fallen short of their initial promi 


If depend- 
able data, 


MULTIPLE APTITUDE BATTERIES FOR SPECIAL PROGRAMS 


Factor analysis underlies the development of classification batteries widely 
employed in the armed services and in certain civilian agencies. The General 


Aptitude Test Battery (GATB) was developed by the United States Em- 
ployment Service for use by employment counselors in the State Employment 
Service offices (14, 22). Prior to the pr 


Paration of this battery, factorial 

teries of 15 to 29 tests, which had 
. These groups included a total of 

2156 men between the ages of 17 and 39, most of whom were trainees in 

vocational courses. A total of 59 tests was Covered by the nine overlapping 

€ investigations, 10 factors were identified and 

15 tests chosen to measure them. In a later revision of the GATB, the 

number of tests was reduced to 12 and the number of factors to 9, 

The factors covered by the GATB are as follows: 
G. Intelligence: Found by 


adding the scores on three tests 
other factors (Vocabul 


ary, Arithmetic Reason, Three 


V. Verbal Aptitude: Measured by a Vocabular 
dicate which two words in each set h 


also used to measure 
-Dimensional Space)- 


Y test requiring examinee to in- 
ave either the same or opposite meaning- 
N. Numerical Aptitude: 


Includes both Computation and Arithmetic Reason 
tests. 


S. Spatial Aptitude: Measured 
ability to comprehend two. 
objects and to visualize effe 


by Three-Dimen 
-dimensional repr 
C 


sional Space test, involving the 


^ $ ional 
esentation of three-dimensiona 
ts of movement in three dimensions. 


P. Form Perception: Me 


asured by two tests re 
tical drawings of tool 


i quiring the subject to match iden- 
5 1n one test and of ge 


©metric forms in the other. 
Q. Clerical Perception: Similar to P, but requiring the Matching of names rather 
than pictures or forms. 
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K. Motor Coordination: Measured by a simple paper-and-pencil test requiring 
the subject to make specified pencil marks in a series of squares. 


F. Finger Dexterity: Two tests requiring the assembling and disassembling, re- 
spectively, of rivets and washers. 


M. Manual Dexterity: Two tests requiring the subject to transfer and reverse 
pegs in a board. 


The last four tests (for measuring factors F and M) require simple appara- 
tus; the other eight are paper-and-pencil tests. Alternative forms are avail- 
able for the first seven tests. The entire battery requires approximately two 
and one-quarter hours. 

The nine factor scores on the GATB are converted into standard scores 
with a mean of 100 and an SD of 20. These standard score norms were de- 
Tived from a sample of 4000 cases representative of the 1940 working pop- 
ulation of the United States in terms of age, sex, educational, occupational, 
and geographical distribution. By testing many groups of employees, appli- 
cants, and trainees in different kinds of jobs, Occupational Ability Patterns 
(OAP) have subsequently been established, showing the critical aptitudes 
and minimum standard scores required for each occupation. For example, 
accounting was found to require a minimum score of 105 in Intelligence 
(G) and 1 15 in Numerical Aptitude (N). Plumbing called for a minimum 
Score of 85 in intelligence (G) and of 80 in Numerical Aptitude (N), 
Spatial Aptitude (S). nd Manual Dexterity (M). An individual's standard 
all those OAP's whose cutoff scores he reaches 


Score profile is matched with s 
assified under these OAP's are then con- 


9r exceeds. The occupations cl 
Sidered in counseling him. i 

The development of an OAP was illustrated in Chapter 7 (pp. 175-176). 
The procedure followed with each group includes job analysis, selection 
Of suitable criterion data (output records, supervisors’ ratings, training per- 
formance, ctc.), and administration of the 12-test battery. The significant 
factors are chosen on the basis of their criterion correlations, as well as the 
Means and SD's of scores on each factor and the qualitative job-analysis in- 
formation. For example, if workers on the job under consideration average 
Considerably above the normative sample in a particular factor and also 
Show relatively low variability in their scores, that factor would probably be 
cluded in the OAP even if it fails to show a significant criterion correla- 
tion, Such a situation could occur if employees on a certain job were a highly 
egard to that aptitude. Specific occupations are grouped 


Selecteq group with reg H à 
torei us e Bus cf Similar OAP's. Thus 22 OAP's have so far been 
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leveloped, covering over 500 occupations. Slightly less than half of these oc- 
cupations have been directly studied by the procedure outlined above. The 
others were added on the basis of similarity in job duties. 

The GATB is used in State Employment Service offices in the counseling 
and job referral of nearly half a million applicants a year. In addition, the 
battery may be obtained by non-profit organizations such as colleges, 
universities, Veterans' Administration hospitals, and prisons, under arrange- 


ments with the appropriate State Employment Service. Permission has also 


been granted to individuals and organizations in many foreign countries to 


translate the GATB and use it for research purposes. Foreign editions are 
available or in preparation in over twenty countries distributed over every 
continent (13). 

Through the facilities of the United States Employment Service and the 
various State offices, a vast body of data has been gathered on the GATB, 


and a continuing research program is in Progress. The normative sample 
is unusually large and representative. A large amount of information on 
predictive and concurrent validit 


y of individual OAP's is reported in the 
manual (22) and in the Validit 


y Information Exchange published in Per- 
sonnel Psychology since 1954, Although cross-validation has not been sys- 


tematically carried out and rather crude Statistical procedures have been 
employed, available cross-validation data are promising. Despite the short- 
ness of the tests, reliabilities of the factor scores are generally satisfactory. 
Both equivalent-form and retest correlations cluster in the .80's and low 
-90's, although the reliabilities of the motor tests ten 


Er nn ; d to be somewhat lower. 
Certain limitations of the b 


attery should be noted. All tests are highly 
speeded. Coverage of aptitudes is somewhat limited. No mechanical com- 


prehension test is included, nor are tests of reasoning and inventiveness well 
represented. The factorial structure of the b 


: attery rests on a series of early 
exploratory studies. A more comprehensive investigation with a large sam- 
ple and a wider variety of tests would Provide a more solid foundation. In 
terms of the over-all empirical evidence, however, the GATB has proved 
to be one of the most successful multiple aptitude batteries in current use. 
ve use of multiple aptitude bat- 
: iain fication purposes after preliminary 
screening with such instruments as the Army General Classification Test 
(AGCT) and the Armed Forces Qualification Test (AFQT) (cf. Ch. 9)- 
Although the Air Force pioneered in the development of classification 
batteries, all branches of the armed Services eventually prepared multi- 
factor batteries for assigning personnel to Specialized military jobs. 

In the Air Force, the Aircrew Classification Battery was developed and 


The armed services also make extensi 
teries. These batteries are given for classi 
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employed during World War II for selecting pilots, navigators, bombardiers, 
and other flight personnel (47). Later, an Airman Classification Test Bat- 
tery (48) was prepared for use with other Air Force personnel. Both batteries 
were constructed by means of factor analysis. In applying these batteries, the 
scores on each test are substituted in the appropriate regression equation 
for the particular specialty. The individual's predicted score, or Aptitude 
Index, for that specialty serves as a basis for job assignments. For ex- 
ample, in computing the Aptitude Index for clerical specialties, significant 
weights are given to tests of arithmetic reasoning, background for current 
affairs, dial- and table-reading, numerical operations, and word knowledge. 
The index for radio operator is based upon tests of arithmetic reasoning, 
dial- and table-reading, electrical information, memory for landmarks, and 
numerical operations. A score on a biographical inventory, derived with a 
different key for each specialty, is also employed in the determination of each 
Aptitude Index. 

On the whole, both the GATB and the various multi-factor batteries de- 
veloped for military use have proved relatively successful as classification in- 
struments. Both differ from the general counseling batteries discussed in the 
Preceding section in that they have been validated chiefly against occupa- 
tional rather than academic criteria. To be sure, training criteria have some- 
times been substituted for actual job performance, but the training pro- 
grams were job-oriented and quite unlike school work. Both training and job 
activities of airplane pilots, bakers, beauticians, and the many other kinds of 
workers included in these testing programs are far removed from traditional 
academic tasks. With criteria differing more widely from each other and from 
the verbally loaded academic criterion, there is more room for differential 
validity. 

Another advantage enjoyed by the military batteries is that the number 
of Occupational fields to be covered is smaller than in a general counseling 
battery, This was especially true of the Aircrew Classification Battery, whose 
task was limited to the assignment of personnel to four or five jobs. As a 
Tesult, it was possible to work with relatively narrow group factors, which 
Specifically matched the criterion requirements. In the Air Force research, 
for example, a large number of different sensorimotor, perceptual, and spatial 
factors were identified and utilized in test construction. A general counseling 
battery, on the other hand, must concentrate on a few broad group factors, 
cach of which is common to many jobs. To do otherwise would require the 
administration of a prohibitive number of tests to each person. But with 
Such an instrument distinctions are blurred and differential validity drops. 
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CHAPTER l4 


Special Aptitude Tests: I 


Even prior to the development of multiple aptitude batteries, it was gen- 
erally recognized that intelligence tests were limited in their coverage of 
abilities. Efforts were soon made to fill the major gaps by means of special 
aptitude tests. Among the earliest were those designed to measure mechani- 
Cal aptitude. Since intelligence tests concentrate chiefly upon "abstract" 
functions involving the use of verbal or numerical symbols, a particular 
need was felt for tests covering the more "concrete" or "practical" abilities. 
Mechanical aptitude tests were developed partly to meet this need. 

The demands of vocational selection and counseling likewise stimulated 
the development of tests to measure mechanical, clerical, musical, and ar- 
listic aptitudes. Tests of vision, hearing, and motor dexterity have also found 
their principal applications in the selection and classification of personnel for 
industrial and military purposes. It is thus apparent that a strong impetus 
to the construction of all special aptitude tests has been provided by the ur- 
Bent problems of matching job requirements with the specific pattern of 
abilities characterizing each individual. 

A word should be added about the concept of special aptitudes. The term 
Originated at a time when the major emphasis in testing was placed upon 
Beneral intelligence. Mechanical, musical, and other special aptitudes were 
thus regarded as supplementary to the “IQ” in the description of the indi- 
Vidual, With the advent of factor analysis, however, it was gradually recog- 
nized that intelligence itself comprises a number of relatively independent 
aptitudes, such as verbal comprehension, numerical reasoning, numerical 
Computation, spatial visualization, associative memory, and the like. More- 
Over, several of the traditional special aptitudes, such as mechanical and 
Clerical, are now incorporated in some of the multiple aptitude batteries. 

What, then, is the justification for a separate chapter on special aptitude 
tests? First, there are certain areas, such as vision, hearing, motor dexterity, 
and artistic talents, that are rarely included in multiple aptitude batteries. 
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From the practical viewpoint of test administration, it is probably more con- 
venient to employ independent tests in these fields. Such a practice permits 
more flexibility, not only in the choice of relevant functions, but also in the 
fullness with which each function is to be measured for specific purposes. 
The second reason for a separate discussion of special aptitude tests per- 
tains to those tests that do overlap the content of multiple aptitude batteries. 


The various clerical and mechanical aptitude tests fall into this category. In 


certain types of testing programs, it is still current practice to employ tests 


of general intelligence as screening instruments and to supplement them with 
more detailed special aptitude tests in relevant areas. In part, the continua- 
tion of this procedure may reflect inertia. But to some extent it undoubtedly 


results from the availability of vocational norms and validation data for the 
special aptitude tests. Such data are not 


yet generally available for the sub- 
tests of the various differential batteries, which have been more recently de- 
veloped. 

This chapter covers Sensory, motor, mechanical, and clerical tests. In the 
following chapter will be considered tests for measuring aptitudes in the ar- 


5, as well as tests of reasoning and creativity- 
Some of the tests designed to predict professional aptitudes, to be considered 
in Chapter 17, could likewise be included und 


er the measurement of special 
aptitudes. It is apparent that the designation "special aptitude tests" has be- 
come somewhat of a catch-all. In it are included à miscellaneous collection 
of tests, each measuring a more narrowly defined area than either intelligence 
tests or multiple aptitude batteries. 
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that the measurement of sensory capacities has served a variety of functions 
in the field of psychological testing. The reader will recall the early attempts 
by Galton and others to measure intelligence by means of sensory tests (Ch. 
1). Although these efforts proved futile, a number of later studies on school 
children have suggested the detrimental effects that visual or auditory handi- 
caps may have upon intellectual development, educational progress, and so- 
cial adjustment (cf. 1, pp. 142-147). The examination of school children for 
the detection of minor visual or auditory deficiencies is now routine practice 
in many school systems. These examinations serve a screening function. The 
children whose test performance shows evidence of defect are then referred 
for individual examination and clinical diagnosis by a specialist. The results 
of such clinical examinations may serve as a basis for corrective treatment, 
assignment to special classes, transfer to special schools, or other appropriate 
action. Screening tests for visual or auditory defects may also be utilized as a 
check prior to the administration of any group tests that require reading or 
oral instructions. A child who falls below the minimum standard on a sensory 
test can then be excluded from the regular testing session and examined by 
more suitable procedures. 

Another application of sensory tests is to be found in the psychological 
Clinic. It is frequently desirable to check for unsuspected sensory deficiencies 
as a possible source of the patient's difficulties. This is especially important 
in cases of reading disabilities and speech defects. But many other conditions, 
Such as behavior disorders or school retardation in children, and depression 
9r abnormal suspiciousness in adults, may have been induced or aggravated 
by an uncorrected sensory deficiency (cf., e.g., 4, 12). 

One of the chief current uses of sensory tests is in the selection of industrial 
Or military personnel. A large amount of research is available that indicates 
the effects of sensory handicaps upon quantity and quality of output, spoilage 
and waste of materials, job turnover, and accidents (cf., e.g., 59, Chs. 5 and 
14), Many types of military specialties likewise make heavy demands upon 
Visual or auditory capacities (cf. 10, 60). Special attention has been given 
to the role of both auditory and visual defects in the causation of accidents. 
Relevant data have been obtained, not only among industrial employees and 
transportation workers, but also among automobile drivers. The psychology of 
traffic thus represents a related field in which increasing use of sensory tests 
1$ being made (cf. 15). 

Although psychological research on sensory capacities has extended to all 
Sense modalities, standardized tests for the measurement of individual differ- 
ences have been limited primarily to vision and hearing. These, of course, are 
the most important modalities for modern man. It is also interesting to note 
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that most available sensory tests are designed chiefly to detect deficiencies. 
The identification of superior capacities has received relatively little atten- 
tion. Perhaps the medical point of view, which is concerned with pathological 
deviations rather than with the whole range of human variation, has influ- 
enced the orientation of sensory tests in this direction. 

In the present section, we shall conside: 
standardized testing techniques in the fie 
Screening instruments designed for general 
tion will be made of the more elaborate 
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the subject correctly identifies the letters; the denominator, the distance at 
Which the average person identifies them. If expressed as decimals or percent- 
ages, the two types of ratios are identical. In the example cited, both are equal 
to .50. 

In certain measures of acuity, especially those obtained under more pre- 
cise laboratory conditions, acuity is reported in terms of the visual angle 
subtended by the smallest object the person can see. As an object of constant 
size moves farther away, it will, of course, subtend a smaller visual angle 
and will become increasingly difficult to see. A large, distant object may 
Subtend the same visual angle as a small object close at hand. The most com- 
mon test object employed in determining the visual angle of the smallest 
discriminable detail is known as the “Landolt C" or “Landolt ring.” This is 
a small circle with a break at one point. Both the size and position of the gap 
are varied in the course of the examination. In each case, the subject is re- 
quired to state the position of the break. The object of the test is to find the 
smallest gap that the subject can correctly locate. If the subject is seated at a 
Standard distance from the test object, the visual angle subtended by this gap 
can be readily determined. 

In general, it has been found that the average person can just barely see 
an object (or detail, such as the above-mentioned break) that subtends a 
Visual angle of one minute (i.c., 1/60th of a degree). This angle corresponds 
to a visual acuity of 20/20 as determined by the Snellen Chart or similar 
tests. Because of the relationship between the two types of tests, visual acuity 
can be readily converted into a visual angle, or vice versa. The visual angle, 
in Minutes, is simply the reciprocal of the visual acuity ratio. For example, 
20/40 acuity corresponds to a visual angle of 2 minutes; 20/100, to an angle 
of 5 minutes; and 20/10, to an angle of / minute. 

The above discussion has dealt only with far acuity tests. What of the 
Other important aspects of vision? Measures of a number of different visual 
characteristics have been incorporated in certain visual screening instruments 
Specially designed for large-scale testing in industry or in schools. The three 
best-known instruments for this purpose are the Ortho-Rater (Bausch and 
Lomb), the Sight-Screener (American Optical Company), and the Telebi- 
Nocular (Keystone View Company). A picture of the Ortho-Rater in use is 
Teproduced in Figure 72. The three instruments are essentially similar in 
Principle, each providing measures of near and far acuity, depth perception, 
lateral and vertical phorias, and color discrimination. 

These instruments have been used extensively in industry, where efforts 
have been made to set up visual qualifications for specific jobs. Some of the 
Tesults obtained in a survey of 3025 workers in 51 industrial jobs can be 
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Fig. 72. The Bausch and L. - s i 
fae h ERA ioa Conner Ortho-Rater for Testing Visual Functions. (Courtesy 
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sults are reported separately for 16 jobs requiring predominantly “near” vi- 
sion, and for 35 jobs requiring varying combinations of "near" and "far" 
vision. 

A thorough investigation of 14 types of wall charts, together with the three 
machine tests described above, was conducted by the Personnel Research 
Branch of The Adjutant General's Office, United States Army (63). In in- 
troducing the study, the authors pointed out that great inconsistency in the 
application of visual standards for military service had resulted from the 
use of different instruments, as well as from inadequate standardization of 
testing and scoring procedures. One part of the study was concerned with 
the reliability of wall charts. For this purpose, 261 men were retested with 
each chart after a 24-hour interval. On the whole, the reliabilities were 
found to be satisfactory. The Snellen Chart, for example, yielded a relia- 
bility of .88. Most of the others had reliabilities close to .80. In another 
Part of the study, similar reliability coefficients were found for most of the 
measures obtained with the three machine tests. 

A major portion of the investigation consisted of factor analyses of the in- 
tercorrelations among the various tests. The most prominent factor revealed 
by these analyses was a retinal resolution factor. Retinal resolution, or the 
ability to "resolve" or distinguish points in the visual field, is essentially what 
is meant by visual acuity. All tests had significant loadings in this factor, 
but its loadings were highest in the far acuity tests. The purest measures of 
this factor were those employing, not letters, but a checkerboard pattern, 
Such as that illustrated in Figure 74. This type of pattern is also used in the 
Ortho-Rater far acuity test. Letter charts, such as the Snellen Chart, had 
additional significant loadings in a form perception factor. A further complica- 
tion in the development of letter tests arises from the inequalities in ease of 
recognizing different letters. Such a problem is avoided in the checkerboard 
tests. 

All near tests had significant loadings in a factor of accommodation. This 
appears to be the principal distinguishing characteristic of near vision. It is 
this factor that shows impairment with advancing age. Among the other 
factors identified were depth perception, lateral phoria, vertical phoria, and 
Convergence efficiency (at the normal reading distance). It should be noted 
that all tests measuring these factors also had appreciable loadings in other 
factors, These non-pertinent factors included the previously mentioned retinal 
resolution factor, which occurred in all the tests, as well as minor group and 
Specific factors. Moreover, the factorial composition of similarly named tests 
from the different machines varied considerably. It may be added that the 
factorial analyses likewise suggested the presence of certain brightness dis- 
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widely from the prescribed standard. Either varying illumination or the fad- 
ing or soiling of test materials may affect different hues unequally and hence 
alter the relationship between colored stimuli. Thus despite careful selection 
of original colors in the construction and printing of the test, a color-blind 
person might pass such a test because of poorly controlled conditions. 

The Illuminant-Stable (I-S) Color Vision Test (28, 29), was designed 
especially to meet these difficulties, The author states that the particular 
colors chosen are such as to yield approximately the same results under 
widely varying conditions of illumination. Some empirical evidence of such 


stability of scores is reported from a study of 100 normal and 100 color-blind 
subjects (29). To reduce soiling and fadin 
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isochromatic plates, as in the Ishihara and Dvorine tests. A definitive evalu- 
ation of the I-S test is not possible until more data have been accumulated. 
Hearing. Like vision, hearing is not a unitary capacity. An individual may 
be normal or superior in one aspect of hearing and seriously deficient in an- 
other. The aspect of most general interest js auditory acuity. Also known 
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measurement of musical aptitudes. At least some of these tests, however, 
could have been included in this section. This is especially true of the pitch 
discrimination test, which has been widely applied outside the field of 
music. 

The remainder of the section will provide a rapid overview of typical 
techniques for the measurement of auditory acuity. For a fuller discussion of 
these techniques, the reader is referred to such books as Davis (13), Hirsh 
(35), and Watson and Tolan (64). Among the simplest procedures em- 
ployed for testing auditory acuity are the whispered speech and watch tick 
tests. In both of these tests, the examiner gradually increases or decreases his 
distance from the subject in order to determine at what distance the subject 
can just barely hear the spoken words or the watch. While the whispered 
Speech test checks acuity over the pitch range usually necessary to under- 
Stand speech, the watch test samples higher frequencies. The latter thus pro- 
vides a useful supplement, since loss of hearing often begins with decreased 
sensitivity for higher frequencies. 

The chief weakness of these tests, of course, is the lack of standardization 
of both stimuli and surroundings. The amount of distracting noise, echoes, 
and other acoustic properties of the room influence the results. Similarly, 
the voices and speech of different examiners and the ticking of different 
Watches are far from uniform. It is probably for these reasons that when 
using such crude tests, each examiner generally applies his own norms. Some 
degree of uniformity is at least possible within each examiner's application of 
the technique. 

A more precise determination of auditory acuity can be obtained with 
ene of the many types of electronic audiometers that are now available. 
For individual testing, pure tone audiometers are generally employed. In ad- 
ministering such a test, it is customary to test one ear at a time, the subject 
Teceiving the sound through a headphone or receiver held against the ear. 
Beginning with a sound too faint for the subject to hear, the examiner 
Bradually increases the intensity of the tone until the subject indicates that 
he hears it, The threshold is determined in both an ascending and descend- 
ing direction. In other words, the intensity is increased until the subject can 
just barely hear the sound, and it is decreased until he can no longer hear it. 
This will be recognized as the psychophysical method of limits commonly 
employed in experimental psychology laboratories. The entire procedure is 
repeated at several frequency levels in order to check for differential hearing 
loss, 

At each frequency level, the subjects hearing loss in decibels can be 
Tead directly from the audiometer dial. The zero point on this dial represents 
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the intensity of sound that the “normal” ear can just barely hear on that 
audiometer. This point was empirically determined in the process of cali- 
brating each instrument, by testing a large normative sampling of persons. 
The subject’s hearing loss, as indicated on the audiometer, represents the 
number of decibels by which sound intensity must be increased above the 
normal threshold in order to be audible to him. The audiometer dial readings 
are used to plot the subject’s audiogram, a graph showing his hearing loss at 
different frequency levels. An audiogram of a school child is reproduced in 
Figure 75. It illustrates differential hearing loss at the higher frequencies in a 
Subject whose hearing is virtually normal at lower frequencies. 
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be expected, the reliability of informal, unstandardized techniques such as 
the whispered speech and watch tick tests is much lower. Moreover, the 
latter tests do not correlate very highly with audiometer tests. The median 
correlations found in one survey were .34 and .35 for the whispered speech 
and watch tick tests, respectively (cf. 60). 

Since one of the most important functions of hearing is the understanding 
Of speech, a more direct measure of this ability may be desirable. Speech 
audiometers are sometimes employed for this purpose. In this case, the sound 
Stimulus the subject receives through the earphone is the human voice pro- 
nouncing numbers, words, or sentences. As in the pure tone audiometer, the 
intensity is varied to determine the point at which the subject can correctly 
understand speech. This intensity is usually higher than that at which the 
Subject can just barely hear speech. Such a procedure provides a more 
functional measure of the social adequacy of the subject's hearing. 

Other types of audiometers are available for group screening tests. 'These 
audiometers are often used in surveys of school children, as many as 40 chil- 
dren being tested simultaneously by this method. For many years, group 
audiometer tests have been conducted with phonographic-type audiometers 
in which speech provides the test stimuli. The most common procedure is 
that involving the so-called fading numbers test. In this test, two-place num- 
bers are presented at decreasing intensities, the subjects being required to 
Write the numbers as they hear them. No attempt is made to explore differ- 
ential hearing loss at different frequency levels. That such a procedure may 
Provide misleading results is illustrated by the case whose audiogram was 
Teproduced in Figure 75. Despite the severe hearing loss at higher frequen- 
Cies, this child had been classified as normal in a group audiometer fading 
numbers test (64, p. 247). The hearing loss indicated by that test was only 
3 decibels, in contrast to the losses of from 80 to 90 decibels revealed by the 
More thorough examination. 

More recently, pure tone audiometers have been adapted for group use 
(cf. 36), Equipment is now available that permits the utilization of the same 
Set of earphones with either a speech phonographic audiometer or a pure 
tone audiometer. Pure tone group tests are scored as either passed or failed; 
they are not designed to yield measures of auditory thresholds. The stimulus 
'S a signal tone presented at each of three frequency levels, six trials being 
given at each level. The subject merely underlines yes or no next to the 
APPropriate trial number to indicate whether or not he hears the signal. In 
Some of the trials, no signal is actually given, the position of these trials being 
Specified on the examiners master sheets. An illustration of a pure tone 
&roup audiometer in use is given in Figure 76. This type of test shows a 
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corporated into industrial and military batteries, they are not generally 
available as standardized tests with broadly applicable norms. They repre- 
sent a rich resource of raw test material for the research worker, but they 
are not a ready-made tool for the test user. 


Fig. 77. Complex Coordination Test Used in Air Force Classi i Program. 
(Courtesy U. S. Air Force; for description of test, cf. Melton, 44.) eu £ 


A comprehensive survey of commercially available, standardized tests for 
motor and mechanical aptitudes appearing prior to 1942 is provided by 
Bennett and Cruikshank (7). Data on the industrial validity of the tests 
cited in this and later sections of this chapter can be found in Dorcus and 
Jones (16), as well as in the Mental Measurements Yearbooks and the test 
manuals. Also relevant is Patterson’s (50) survey of all published results on 
the validity of tests in predicting success in trade and vocational schools. 

Two tests measuring simple hand and finger movements required by cer- 
tain routine assembly jobs are the O'Connor Finger Dexterity and Tweczer 
Dexterity Tests (45). These tests measure the speed with which the subject 
can insert pins into small holes by hand or with a tweezer, respectively. In 
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the former, three pins are placed into each hole; in the latter, the holes are 
smaller, and a single pin is inserted in each. Although the reliability of 
these tests seems to be satisfactory, their validity needs to be carefully 
checked in terms of specific criteria. The tests are probably fairly valid for 
selecting individuals for jobs requiring the manipulation of small objects 
with fingers or tweezers. But they may show little or no relationship to other 
types of manipulative ability. 

The Crawford Small Parts Dexterity Test (11), shown in Figure 78, cov- 
ers a wider variety of manipulative skills. In Part I of this test, the subject 


Fig. 78. Crawford Small Parts Dexterity Test. (Courtesy The Psychological Corpora- 
tion.) 


Uses tweezers to insert pins in close-fitting holes, and then places a small 
Collar over each pin. In Part II, small screws are placed in threaded holes 
and screwed down with a screwdriver. The score is the time required to 
Complete each part. Split-half reliability coefficients between .80 and .95 are 
Teported for the two parts of this test. Despite the apparent similarity of the 
functions required by Parts I and II, the correlations between the two parts 
ranged from .10 to .50 in several industrial and high school samples, with a 
Median correlation of .42. 

Another manual dexterity test which, however, utilizes no tools is the 
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Purdue Pegboard (57, 58). This test is said to provide a measure of two 
types of activity, one requiring gross movements of hands, fingers, and arms, 
and the other involving "tip of the finger" dexterity needed in small assem- 
bly work. First, pins are inserted individually in small holes with the right 
hand, left hand, and both hands together, in successive trials. In another 
part of the test, pins, collars, and washers are assembled in each hole. The 
prescribed procedure for this activity involves the simultaneous use of both 
hands. 

Two tests requiring a similar type of gross manual dexterity are the 
Minnesota Rate of Manipulation Test (65) and the Stromberg Dexterity 
Test (56). The former consists of a board containing 60 circular holes into 
which 60 cylindrical blocks are to be placed. In the second part of this test, 
each block is removed from the board, turned over with the other hand, and 
returned to its hole. It is interesting to note that a correlation of only .57 was 
found between the “placing” and “turning” parts of this test. 

The Stromberg Dexterity Test is illustrated in Figure 79. Red, yellow, and 
blue blocks are to be inserted in a prescribed sequence in the correspondingly 
colored sections of the board. Before each trial, the blocks are arranged in 
standard order, with only one color to a row in one trial, and only one 
color to a column in another. The subject is required to pick up the blocks 


in a specified pattern that prevents the placement of two blocks of the same 
color in immediate succession. 


r 


Fig. 79. Stromberg Dexterity Test. (Courtesy The Psychological Corporation.) 
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As a final example, we may consider the Bennett Hand-Tool Dexterity 
Test (cf. Fig. 80). This test was designed “to provide a measure of pro- 
ficiency in using ordinary mechanics’ tools” (6). Although performance is un- 
doubtedly influenced by the subject’s past experience in handling tools, the 
test was constructed so as to maximize the role of manipulative skill rather 
than mechanical information. The task is simply to remove all the nuts and 
bolts (of different sizes) from the left upright and mount them on the right 
upright in a prescribed sequence. The score is the total time required to com- 
plete this task. 


Fig. 80. Bennett Hand-Tool Dexterity Test. (Courtesy The Psychological Corpora- 
tion.) 


Evaluation of Motor Tests. What can be said about the effectiveness of 
Motor tests as a whole? The most important point to note in evaluating such 
tests is the high degree of specificity of motor functions, Intercorrelations and 
factor analyses of large numbers of motor tests have failed to reveal broad 
8roup factors such as those found for intellectual functions. The most ex- 
tensive factorial research on motor functions has been conducted with Air 
Force data, principally by Fleishman (21, 22, 25, 26, 27, 44). Among the 
Major factors identified by Fleishman in a series of factorial analyses are 
the following: 


Control Precision: Ability to make fine, highly controlled but not overcontrolled 
Muscular adjustments—important in the rapid and accurate operation of con- 
trols by hand, arm, and foot movements (as in Complex Coordination and Ro- 
tary Pursuit Air Force tests). 


Multi-Limb Coordination: Ability to coordinate gross movements requiring the 
Simultaneous use of more than one limb in any combination. 
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Response Orientation: Ability to select the appropriate response under highly 
speeded conditions—identified in complex coordination tests in which each pat- 
tern of signals requires a different choice of controls and direction of movement. 


Reaction Time: Speed with which an individual is able to respond to a stimulus 
when it appears—found to be independent of specific response required and of 
whether the stimulus is auditory or visual. 


Speed of Arm Movement: Speed with which gross arm movements can be made, 
regardless of precision 


Rate Control: Ability to make continuous anticipatory motor adjustments rela- 


tive to changes in speed and direction of a moving target—the common factor in 
pursuit and tracking tests. 


Manual Dexterity: Ability to make skillful, well-controlled arm-hand movements 
in manipulating fairly large objects under speed conditions. 


Finger Dexterity: Ability to make skillful, controlled manipulations of small ob- 
jects, involving primarily finger movements. 


Arm-Hand Steadiness: Ability to make precise arm-hand positioning movements 
where strength and speed are minimized. 


Wrist-Finger Speed: Traditionally called "tapping," this ability is best measured 


by paper-and-pencil tests requiring rapid tapping of the pencil in relatively large 
areas. 


Aiming: A narrowly defined ability measured chiefly by paper-and-pencil "dot- 
ting" tests which require subject to place a dot accurately and rapidly in each of 
à series of small circles. 


Still other factors have been identified in the 
ments, as manifested in athletic skills (34). These include limb strength. 
trunk strength, limb flexibility, trunk flexibility, energy mobilization (ability 
to exert maximum energy at a given moment), static balance, dynamic bal- 
ance, and gross body coordination (involving trunk and limbs). As new 
data are gathered, this list of motor factors is 
and redefinition, but most of the factors name 
independent investigations. 

It has also been shown that the abilities called into play by motor tests 
may change with practice (23, 26). In a Study involving continued practice 
on the Complex Coordination Test illustrated in Figure 77, both the number 
and nature of factors identified at different Stages of practice varied (26)- 
At the early stages, non-motor factors such as Spatial orientation, visualiza- 


area of gross bodily move- 


constantly undergoing revision 
d have been verified in several 
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tion, mechanical experience, and perceptual speed entered into performance 
together with motor factors. With increasing practice, the importance of the 
intellectual factors declined and that of the motor factors increased. In the 
last stages of practice, the only common factors having significant weights 
Were speed of arm movement and control precision. A factor specific to the 
Complex Coordination Test also increased in weight with practice. Similar 
results were obtained in later analyses of other complex motor tests (23, 26). 

The factorial composition of a motor test may also be affected by its 
difficulty level (24). This relation was investigated with a specially designed 
motor test requiring the subject to press a button on the response panel that 
corresponded to the position of a light appearing on the display panel. By 
varying the reference point on the display panel, the relation between stim- 
ulus lights and response buttons could be made more complicated and the 
difficulty of the task increased. Under the simplest conditions, individual 
differences in performance proved to be largely a function of perceptual 
Speed. As the task was made more difficult, performance depended in- 
creasingly on the spatial orientation and response orientation factors. The 
finding that both practice and difficulty affect the factors determining per- 
formance on a motor task suggests that the validity of the same test for a 
given criterion may vary accordingly. Moreoever, subjects of different ability 
levels may solve the same task through the use of different processes. 

The change in nature of many motor tests with practice complicates the 
determination of reliability. Increasing the length of a motor test does not 
usually result in as great a rise in reliability as found with intellectual tests, 
since the different portions of the motor test may not measure quite the same 
functions, In general, the reliabilities of motor tests are not as high as those 
Of other types of tests, many falling in the .70's and .80's. Some of the 
Simple motor tests described in this section, however, have higher reliabili- 
ties. In such simple tasks, there is less likelihood that practice will alter the 
nature of the test. 

In considering the validity of motor tests, we need to differentiate between 
Complex motor tests that closely resemble the particular criterion perform- 
ance they are trying to predict and tests of simple motor functions designed 
for more general use. The former are well illustrated by some of the Air 
Force tests. Such complex, custom-made tests that reproduce the combina- 
tion of motor aptitudes required by the criterion have shown fair validity. 
The Complex Coordination Test of the Air Force, for example, considerably 
'Mproved the prediction of performance in pilot training. For most purposes, 

Owever, the use of such tests is not practicable, since a very large number 
Of tests would have to be devised to match different criteria, Moreover, 
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there is a question as to whether the same result might not be achieved 
through a combination of intellectual tests and simple motor tests. For coun- 
seling purposes, of course, such highly specific tests would be of little use. 

With regard to commercially available motor tests, the functions they 
measure are very simple and their validities against most criteria are not 
high. For this reason, such tests can serve best as part of a selection battery, 
rather than as single predictors. In general, motor tests have been most 
successful in predicting performance on routine assembling and machine-oper- 
ating jobs (21, 31). As the jobs become less repetitive, perceptual and intel- 
lectual factors come to play a more important part. Because of the speci- 
ficity of motor functions, validity should be reported in terms of specific 
criteria and should be rechecked whenever the test is applied in a different 
situation. 

In this connection it should also be noted that different results may be 
obtained through the use of quality or quantity criteria, when the latter are 
applied to the same job. In one study on power sewing-machine operators 
(46), certain tests proved to be valid predictors of quality, while other tests 
gave high correlations with speed of work. In general, the motor tests in- 
cluded in this study correlated higher with speed than with quality of output, 
as might be expected. For example, the O'Connor Tweezer Dexterity Test 
yielded correlations of .46 and .07 with speed and quality criteria, respec- 
tively. The corresponding correlations obtained with the Minnesota Rate 
of Manipulation Test were -31 and .08. Thus in a factory that specializes in 
high-quality merchandise, the motor dexterity tests might be poorer pre- 
dictors than in one concerned with rapid production for a mass market. 


MECHANICAL APTITUDE 


Mechanical aptitude tests cover a variety of functions. Motor factors enter 
into some of the tests in this category, either because the rapid manipulation 
of materials is required in the performance of the test, or because special 
subtests designed to measure motor dexterity are included. In terms of the 
factors discussed in the preceding chapter, perceptual and spatial aptitudes 
play an important part in many of these tests. Finally, mechanical reasoning 
and sheer mechanical information predominat 
aptitude tests. 

It is important to recognize the diversity of functions subsumed under the 
heading of mechanical aptitude, since each function may be differently Te- 
lated to other variables. For example, mechanical information tests are much 
more dependent upon past experience with mechanical objects than are ab- 


€ in a number of mechanical 
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Stract spatial or perceptual tests. Similarly, sex differences may be reversed 
from one of these functions to another. Thus in manual dexterity and in 
perceptual discrimination tests, women generally excel; in abstract spatial 
tests, a small but significant average difference in favor of males is usually 
found; while in mechanical reasoning or information tests, men are markedly 
Superior, the difference increasing with age (cf. 1, Ch. 14). 

In an attempt to clarify the nature of mechanical aptitude, Harrell (33) 
Conducted a factor analysis of 31 tests that had been administered to a 
group of 91 machine-fixers in a cotton mill. Apart from three verbal tests, 
all were designed to measure some aspect of mechanical aptitude. In addition, 
the factor analysis included scores on an interest test, ratings, age, amount of 
Schooling, and mechanical experience. Three principal factors were found to 
have significant loadings in the mechanical aptitude tests. These were de- 
Scribed as perceptual, spatial, and agility or manual dexterity. It might be 
noted that no mechanical information tests were administered in this study. 
Had such tests been included, one or more additional factors would probably 
have been identified. 


^ 


Fig. 81. Minnesota Spatial Relations Test. (Courtesy C. H. Stoelting Company.) 

We may now consider some examples of different types of tests designed 
to measure “mechanical aptitude.” Among the tests emphasizing abstract 
Spatial and perceptual abilities are to be found formboards, construction 
Puzzles, and diagrammatic paper-and-pencil tests. The Minnesota Spatial Re- 
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lations Test (49), illustrated in Figure 81, falls into this category. This is one 
of the tests standardized in an extensive investigation of mechanical aptitude 
conducted at the University of Minnesota (47). It consists of four form- 
boards, each containing 58 variously Shaped cutouts. One set of blocks is 
used with Boards A and B, another with Boards C and D. Both time and 
errors are scored. It might be noted that in the Harrell study (33) this test 
had its highest loading in the perceptual factor. 

Another test developed in the same Minnesota study is the Minnesota 
Paper Form Board. A later revision employing multiple-choice items was 
prepared by Likert and Quasha (38). This test is now available in two 
equivalent forms, obtainable in either hand-scored or machine-scored edi- 
tions. Two sample items are reproduced in Figure 82. Each item in the test 


For each item, the sub; 
result if the pieces in 


ject must choose the figure which would 
the first section were assembled. 


Fig. 82. Sample Items from the Revised Minnesota P. 
by permission of The Psychological Corporation.) 


aper Form Board. (Reproduced 
consists of a figure cut into two or m 


; TEN 
ore parts. The subject determines hov 
the pieces would fit together into the 


complete figure, and chooses the oe 
ing that correctly shows this arrangement. An unusually large number 0 


studies have been conducted with the Minnesota Paper Form Board Test. 
The results indicate that it is one of the most valid available instruments for 
measuring the ability to visualize and manipulate objects in space (38)- 
Among the criteria employed in this research were performance in ied 
courses, grades in engineering and in other technical and mechanica 
courses, supervisors” ratings, and objective production records. The test has 


also shown some validity in predicting the achievement of dentistry and at 
students. 
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Reference should likewise be made to the Spatial Relations Test of the 
DAT battery, discussed in Chapter 13. It will be recalled that each of the 
tests in this battery is printed in a separately obtainable booklet and has its 
Own norms. Such a test can therefore be used when only a measure of 
spatial aptitude is desired. Since it requires a somewhat more complex type 
of three-dimensional visualization, it may tap a different aspect of spatial 
ability than that covered by the Minnesota Paper Form Board. 

A test that undertakes to measure several aspects of mechanical aptitude 
is the MacQuarrie Test for Mechanical Ability (41). This test comprises the 
following seven subtests: Tracing, Tapping, Dotting, Copying, Location, 
Blocks, and Pursuit. The first three subtests were included as measures of 
speed and accuracy of eye-hand coordination. In the Harrell study, however, 
dotting was the only subtest of the three having a high loading in the agility 
factor (.501 ). A sample item from this test is shown in Figure 83. The re- 
Maining four subtests were designed to measure spatial ability. According 
to Harrell's findings, all four have their highest loading in the spatial factor. 
Sample items from the three tests most heavily saturated with this factor are 
Teproduced in Figure 83. 

It might be added that most of Harrell’s findings regarding the MacQuarrie 
test were corroborated in a subsequent factorial analysis of the subtest scores 
of 329 radio assembly operators (31). In this study, a spatial factor was the 
most prominent in the entire battery, although its highest loadings were in 
the Location, Copying, Blocks, and Pursuit tests. A controlled movement fac- 
tor, Probably corresponding to Harrell’s agility factor, showed significant 
Weights in Tapping, Dotting, and Tracing. Slight evidence of a third factor, de- 
Scribed as visual inspection, was found in the Tracing, Dotting, and Pursuit 
tests, all of which require careful observation of visual details. 

Norms are provided for total scores on the MacQuarrie test, as well as for 
Sach subtest. The use of specific subtest score patterns for different jobs is 
recommended in the manual. In this connection it might be noted that an 
Sarly study of retest reliability yielded a coefficient of .90 for total scores 
and coefficients ranging from .72 to .86 for subtests (42). No data on re- 
liability are cited in the manual, nor is any information provided regarding 
Subtest intercorrelations. A number of validity studies employing various 
industrial criteria have been conducted with this test (41). A few of the 
reported validity coefficients for either individual subtests or combinations of 
Subtests fall between .40 and .50; others are lower. 

: Like spatial aptitude tests, measures of mechanical reasoning and informa- 
“on can also be divided into performance and paper-and-pencil types. The 
former are “assembly tests,” requiring the subject to put together common 
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Blocks: How many blocks touch each 
block with an X on it? 


Pursuit: Follow each line b 


Y eye and show where it ends, by 
writing its number in the c 


orrect box at the right, 


m o- 
Fig. 83. Sample Items from the MacQuarrie Test for Mechanical Ability. (ReP™ 
duced by permission of California Test Bureau.) 


mechanical objects from the given parts. One of the earliest tests of this typ“ 
is the Stenquist Assembly Test (55), developed during World War I. sd 
vision and extension of this test was standardized in the previously n 
Minnesota study (47) and is now available as the Minnesota Mechanica 
Assembly Test (48). The three boxes containing the mechanical objects i 
be assembled are shown in Figure 84. A shorter form involving fewer object? 
is also available. Time limits are long enough to render the contribution oi 
agility minimal. Although norms for other groups are available, these tests 
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are especially suitable for high school boys. On such a group, the long form 
yielded an odd-even reliability of .94 and a correlation of .53 with a care- 
fully measured, comprehensive criterion of success in shop work (47). The 
practical usefulness of such assembly tests is limited by difficulties of ad- 
ministration, scoring, and maintenance, as well as by the impossibility of 
testing more than a few subjects simultaneously. 


Fig. 84. Minnesota Mechanical Assembly Test. (Courtesy C. H. Stoelting Company.) 


To meet these practical problems, pictorial tests of mechanical compre- 
hension have been developed. Such tests were also initiated by Stenquist with 
an early test requiring the subject to match pictures of mechanical objects 
Or of parts of objects that belong together. A more recent application of the 
same approach is found in the Mellenbruch Mechanical Motivation Test 
(43). The rationale underlying this test is that individuals who are mechani- 
cally adept and interested in tools and machinery are more likely to have 
acquired the information required on the test. To what extent scores on this 
test reflect interest or motivation, mechanical aptitude, or sheer amount of 
Mechanical experience cannot be determined from the available data. In 
View of the paucity of information on this test, it must be regarded as being 
Still in an experimental stage. 

A combination of pictures and questions, which permits wider coverage of 
Content and more emphasis on the understanding of mechanical principles, 
15 found in the Bennett Test of Mechanical Comprehension (5). This test 
has been widely used for both military and civilian Purposes. The currently 
available civilian forms include: Form AA, suitable for boys in high school 
Or trade school, for unselected adult men, and for certain industrial groups; 
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When the two numbers or names in a pair are exactly the same, 
make a check mark on the line between them. 


66273894 66273984 


527384578 527384578 


New York World New York World 


Cargill Grain Co. Cargil Grain Co. 


Fig. 86. Sample Items from the Minnesota Clerical Test. (Reproduced by permission 
of The Psychological Corporation.) 


Data on both concurrent and predictive validity of the Minnesota Clerical 
Test are provided by a number of scattered studies (ef. 2, 16). Moderately 
high correlations have been found between scores on this test and ratings bY 
office supervisors or by commercial teachers, as well as performance records 
in courses and in various kinds of clerical jobs. Several studies employed the 
method of contrasted groups. Thus comparisons are reported between differ- 
ent levels of clerks, between clerical workers and persons engaged in other 
occupations, and between employed and unemployed clerks. All these con 
parisons yielded significant differences in mean scores in the expected direc 
tion. A marked and consistent sex difference in favor of women has been 
found on this test, beginning in childhood and continuing into adulthood. 

It is apparent, of course, that such a relatively homogeneous test as the 
Minnesota Clerical Test measures only one aspect of clerical work. Clerical 
jobs cover a multiplicity of functions. Moreover, the number and particular 
combination of duties vary tremendously with the type and level of job- 
Even specific jobs designated by the same name, such as typist, filing clerk, 9! 
shipping clerk, may differ considerably from one company to another, owing 
to the size of the company, degree of possible specialization of jobs, nature © 
the work, and other local conditions. Despite such a diversity of activities 
however, job analyses of general clerical work indicate that a relatively 
large proportion of time is spent in such tasks as classifying, sorting, checking: 
collating and stapling, Stuffing and sealing envelopes, and the like (cf. 8)- 
Speed and accuracy in perceiving details, together with a certain minimum 
of manual dexterity, would thus seem to be of primary importance for the 
clerical worker. 
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To be sure, many other types of jobs require perceptual speed and ac- 
curacy. Inspectors, checkers, packers, and a host of other factory workers 
Obviously need this ability. It is interesting to note in this connection that 
the Minnesota Clerical Test has also been found to have some validity in 
predicting the performance of such workers (cf. 16). However, it is likely 
that higher validity may be obtained in such cases by designing a similar 
test with pictorial rather than with verbal or numerical material. It will be 
recalled that, in the GATB prepared by the United States Employment Serv- 
Ice, separate tests are included for the Q and P factors. The Q factor ap- 
peared in number- and word-checking tests similar to those that make up the 
Minnesota Clerical Test. The P factor, on the other hand, occurred in tests 
requiring the perception of similarities and differences in spatial items, and is 
Probably more closely related to the inspection of materials. 

Several tests of clerical aptitude combine perceptual speed and accuracy 
With other functions required for clerical work. Among the measures used for 
the latter functions are "job sample" types of tests for such activities as 
alphabetizing, classifying, coding, and the like. In addition, some measure of 


Instructions: After each name, write the number of the drawer in which that record 


should be filed. Work quickly and accurately. The first two are marked correctly. 
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Davidson, C. H. AQ 
n E 87. Alphabetizing Subtest from the General Clerical Test, Showing Two Prac- 
e Items. (Reproduced by permission of The Psychological Corporation.) 


verbal and numerical ability may be included, to serve in lieu of a “gen- 
eral intelligence” test. An example of such a composite test of clerical apti- 
tude is the Psychological Corporation General Clerical Test (52). This test 
Consists of nine subtests, grouped so as to yield clerical, numerical, and 
Verbal scores, as well as a total score. The first two tests, Checking and Al- 
dies are designed to measure speed and accuracy in touting clerical 
bs » The A Iphabetizing test. which has considerable face validity for cleri- 

Orkers, is illustrated in Figure 87. The numerical score is derived from 
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tests of Arithmetic Computation, Error Location, and Arithmetic Reasoning. 
Four tests are combined to yield the verbal Score, namely, Spelling, Reading 
Comprehension, Vocabulary, and Grammar. The entire battery requires ap- 
proximately fifty minutes. 

Other clerical tests covering a combination of aptitudes include the Turse 
Clerical Aptitudes Test (62), SRA Clerical Aptitudes test (53), Short Em- 
ployment Tests (9), and Purdue Clerical Adaptability Test (37). Mention 
should also be made of the Clerical Speed and Accuracy test of the DAT 
(Ch. 13). In selecting clerical workers, this test could be used singly or in 


combination with the two Language Usage tests and possibly some of the 
other parts of the battery. 


general intelligence and those requiring perception of 
ave good validity in this Situation, but motor tests do not. A factor 
analysis (3) of 17 subtests taken from commonly used clerical aptitude tests 
yielded three factors, identified as perceptual analysis, speed in making 
simple discriminations, and comprehension of relations (primarily verbal, but 
also appearing in some numerical tests). It is interesting to note that the 
Minnesota Clerical Test had substantial loadings in all three of these factors- 

A few aptitude tests for typing and shorthand have also been developed, 
to predict a student’s performance in learning these skills. Such tests are de- 
signed for use prior to training and are thus to be distinguished from pro" 
ficiency tests in typing or shorthand, to be discussed under achievement tests 
(Ch. 17). Examples of aptitude tests in this area include the Turse Short- 
hand Aptitude Test (61), E. R. C. Stenographic Aptitude Test (14). and 


details h 


of this test appear promising. 


N 
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CHAPTER ]5 . ——— 


Special Aptitude Tests: II 


Continuing the survey of special aptitude tests begun in Chapter 14, this 
chapter is concerned with tests in the areas of art, music, literary apprecia- 
tion, and creativity. Tests in the general field of reasoning, inventiveness, 
originality, and creativeness have constituted an important and growing focus 
of research interest since the end of World War II. Much of this activity 
stems from a concern with the identification and training of high-level 
scientists. Although oriented principally toward scientific productivity, such 
research also sheds some light on artistic creativity A 

The development of tests specifically designed for measuring aesthetic 
abilities, on the other hand, has been slow and sporadic. Little progress in the 
testing of artistic, musical, or literary aptitudes has been made since the early 
1940's. In number, Scope, and technical refinements, tests in this area have 
lagged far behind other aptitude tests. In part, this condition may result from 
the resistance artistically trained persons have exhibited toward objective 


measurement, quantification, and the “scientific” approach to artistic talen 
Traditionally, art and science have be 


gical testing with suspicion or skepticis™ 
-constructed tests of artistic aptitudes 
systems of our contemporary culture. The great- 
» in general, be exerted in constructing those tests that meet ie 
most urgent social needs, To a large extent, the development of tests reflects 
the demand for such instruments, In our culture, the demand for testing of- 
fice clerks, engineers, or Air Force pilots has proved more widespread a” 
more insistent than the demand for testing poets, musicians, or painters. z 

Aesthetic aptitudes represent a broad and varied category of traits. It a 
apparent, of course, that a different constellation of talents is required E 
music, the graphic arts, and literature. Further specialization within cach ° 
these fields is also undo 


ubtedly associated with a diversity of personal quali- 
400 


est effort will 
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fications. The poet and the biographer, the traditional portraitist and the sur- 
realist painter, the coloratura soprano in grand opera and the saxophonist 
in a dance orchestra—all must obviously meet a very different set of require- 
ments, 

Within the same form of art, moreover, the individual may play any one 
Of a variety of roles, each with its own characteristic set of qualifications. 
Thus the creative artist, art teacher, critic, collector, dealer, and museum 
Worker approach art with a different pattern of skills, knowledge, interests, 
attitudes, feelings, and motives. In music, similar differences can be recog- 
nized between the composer, performer, critic, teacher, and appreciative 
public, not to mention the obvious differences associated with type of instru- 
ment, or with popular versus classical music. The field of literature can like- 
wise be subdivided in a corresponding fashion. It is evident that a thorough 
Coverage of artistic aptitudes would require many different "job specifica- 
tions," some of which might have little in common. 

Turning now to available tests for the measurement of aesthetic aptitudes, 
We find that such tests fall principally into three major classes, namely, art 
Cin the narrow sense of graphic arts), music, and literature. Each will be con- 
Sidered in a separate section. Within each of these categories, a further dis- 
tinction can usually be made between tests of appreciation and tests of pro- 
duction, 

It is obvious, of course, that appreciation does not require productive 
Skills, A person may be a highly discriminating and sophisticated connoisseur 
Of paintings without himself being able to paint. But artistic production, ex- 
Cept at a routine and mechanical level, undoubtedly presupposes superiority 
in both appreciative and productive skills. Thus tests of art appreciation have 
a broader applicability than tests of production. Moreover, productive skills 
üre more closely dependent upon specific training. The measurement of such 
Skills is therefore more likely to fall under the heading of achievement tests. 
AS we consider specific tests it will become apparent, however, that in the 
Measurement of artistic talents the distinction between aptitude and achieve- 


Ment tests is especially tenuous. 


ARTISTIC APTITUDES 


Tests of artistic appreciation have generally followed a common pattern. 
In each item, the subject is requested to express his preference regarding two 
OF More variants of the same object. One variant is either an original by an 
Minent artist or a version preferred by the majority of a group of art experts. 
The other versions represent deliberate distortions designed to violate some 
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accepted principle of art. Any controversial items, on which a clear con- 
sensus of experts cannot be obtained, are normally eliminated from such a 
test. 4 

In the development of art appreciation tests, both original item ea 
and subsequent validation procedures depend heavily upon the opinions 9 
contemporary art experts within our culture. It is well to bear this fact in 
mind when interpreting scores, Essentially, such tests indicate the degree to 
which the individual's aesthetic "taste" agrees with that of contemporary de 
experts. The representativeness of the particular group of experts m ien 
in developing the test is obviously an important consideration. In so far n 
aesthetic standards or “taste” may change with time, the periodic rechecking 
of scoring key and item validities is likewise desirable. 

The McAdory Art Test (41, 54) represents one of the earliest ane 
to measure artistic appreciation. First published in 1929, this test is now 
historical interest only, because many of its items are outdated. The test ied 
ployed contemporary materials taken from art and trade magazines, as eA 
as art objects chosen from museums and art books, The items cover ou 
varied categories as furniture and household utensils, textiles and pra 
automobiles, and painting and other graphic arts. Each item consists of Nos 
variations, to be ranked in order of preference by the subject. The aan 
Versions differ in shape and line arrangement, massing of dark and light, 
color. "e 

The Meier Art Judgment Test (45), a revision of the earlier Meier- = 
shore Test (47), is undoubtedly the most widely used test of artistic eee 
tion. This test, whose first edition also appeared in 1929, was revised in ie s 
The revision consisted essentially in the elimination of the 25 items havi 
the lowest correlations With total score and, wi 


the allotment of double credit to the 25h 
total score. 


s Te items, 
thin the remaining 100 it ith 
: E 1 M 
aving the highest correlations 
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i imi i ier Art Judgment Test. This 
Fig. 88. Item Similar to Those Employed in the Meier 

then does not idem in the current revised form. The difference between the two ver- 

sions is in the angle at which the window seat crosses the picture. (Courtesy Norman C. 


eier.) 


For more than two decades, Meier and his associates at the University of 
lowa conducted research on the nature of artistic talent. Although most of 
their subjects were children, data on adult groups, including professional 
artists, were also obtained. This research led Meier to the conclusion that 
artistic aptitude comprises six interlinked traits, viZ., manual skill, volitional 
Perseveration, aesthetic intelligence, perceptual facility, creative imagination, 
and aesthetic judgment. The Meier Art Judgment Test is designed to meas- 
ure only the last of these six traits. The original plans called for the prepara- 
tion of creative imagination and aesthetic perception tests, but little progress 
Was made on the development of these tests. It might be added that the six 
traits listed were not identified by factor analysis, but are based upon the 
author's interpretation of a mass of relevant obseryations: Corroboration by 
Means of factor analysis would be desirable. More objective evidence on the 
relation between aesthetic judgment and the quality of artistic production 
should likewise be provided. 

Percentile norms are given for three groups: 1445 junior high school stu- 
dents; 892 senior high school students; and 982 adults, including college and 
art school shide All norms were derived largely from students in art 
Courses, whether in high school, college, or special art schools. Data were 
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gathered in 25 schools scattered throughout the United States. Split-half re- 
liability coefficients between .70 and .84 are reported for relatively homo- 
geneous samples. 

Most of the available evidence regarding validity of the Meier Art Judg- 
ment Test can be summarized under the headings of item selection and con- 
trasted group performance, although a few correlations with independent 
criteria of artistic accomplishment have also been reported. First, it should 
be noted that the original items were chosen from reputable art works and 
that the distortions were such as to violate accepted art principles. A large 
number of items thus assembled were then submitted to 25 art experts, and 
only those items showing clear-cut agreement among the experts were re- 
tained. Finally, items were selected on which from 60 to 90 per cent of a 
group of 1081 miscellaneous subjects chose the original as the preferred ver- 
sion. In the revised edition, it will be recalled, items were further selected on 
the basis of internal consistency. 

Total scores on the Meier Art Judgment Test exhibit a sharp differentia- 
tion in terms of age, grade, and art training. Thus art faculty score higher 
than non-art faculty, art students higher than comparable non-art students: 
The extent to which these group differences result from selection or from 
previous art training cannot be determined from available data. Although nO 
validity coefficients are given in the manual, a few are reported in other pub- 
lished sources. Correlations ranging from .40 to .69 have been found be- 
tween scores on this test and art grades or ratings of creative artistic ability 
(8, 34, 48). 

As in most artistic aptitude tests, the Meier Art Judgment Test has regu- 
larly shown negligible correlation with traditional intelligence tests, such 25 
the Stanford-Binet or group verbal tests. This does not mean, however, that 
abstract intelligence or scholastic aptitude is unrelated to ultimate success in 


an art career. In fact, there is some evidence to indicate that, for higher 
; 


levels of artistic accomplishment, superior scholastic aptitude is a decided 


asset. In one of the investigations conducted at Iowa, for example, the mean 
IO of successful artists was found to be 119 (57). Simil à 
artistically gifted children studied at the Universi 
ing from 111 to 166 (46). 

While the McAdory Test employed many contemporary dated items; and 
the Meier Test was constructed from relatively cid art products, the 
more recently developed Graves Design Judgment Test (22 pes consists 
exclusively of abstract designs. Non-representational figures ETE chosen in 
order to evoke a purely aesthetic response, E 


rae S 
i ; : unencumbered by association" 
with specific objects. In the development of this test about 150 items wel? 


arly, a group of 
ty of Iowa had IO's rang 
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prepared, each consisting of two or three comparable designs. In each item, 
one design was organized in accordance with certain aesthetic principles, 
including “unity, dominance, variety, balance, continuity, symmetry, propor- 
tion, and rhythm." The other design or designs violated one or more of these 
principles, The preliminary form of the test was administered to art teach- 
ers, art students, and non-art students, with instructions to indicate the pre- 
ferred design in each set. Items were retained on the basis of: (a) agreement 
among art teachers regarding the preferred design; (b) more frequent se- 
lection of the “better” design by art students than by non-art students; and 
(c) internal consistency, i.e.. "better" design chosen more often by those 
obtaining high scores than by those obtaining low scores on the entire test. 


Subject indicates which design he prefers. 


Fig. 89, Item from the Graves Design Judgment Test. (Reproduced by permission of 
The Psychological Corporation.) 

The final test consists of 90 items, 8 containing three designs each, the rest 
Containing only two. The designs are executed in black, white, and gray. 
Some are line drawings; others are composed of squares, circles, triangles, 
and similar two-dimensional figures; still others look like reproductions of 
three-dimensional abstract sculptures. A sample item is reproduced in Figure 
89. Percentile norms are given for several art and non-art student groups at 
the high school and college level, all tested in New York State. Split-half 
Teliability coefficients in fairly homogeneous groups ranged from .81 to .93, 
With a median of .86. Validity data are meager, being based chiefly on 
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significant differences in mean scores between contrasted criterion groups. 

“Most tests of creative artistic ability are actually worksamples. As such, 
they are undoubtedly influenced to a large extent by formal art training and 
could be regarded as achievement tests. A number of such tests, however, 
have been designed specially for use in predicting performance in subse- 
quent training and may therefore be included in the present category. Among 
the best known are the Lewerenz Tests in Fundamental Abilities of Visual AT 
(38), the Knauber Art Ability Test (35. 36), and the Horn Art Apti- 
tude Inventory (31). The Lewerenz was designed for grades 3 to 12, the 
Knauber for grades 7 to 16, and the Horn for grades 12 to 16 and adults. The 
first two include a variety of subtests, covering art appreciation and informa- 
tion, as well as drawing skills and originality. Both are early tests that have 


4 E ; adequate 
not been revised. Insufficient data are reported to permit an adequ n 


: À : ? e 
evaluation of their effectiveness. In general, however, they appear to D 


crude when judged in terms of present test construction stand 


ards. 
3 B as 
The Horn Art Aptitude Inventory was more recently developed and h 
undergone a certain 


amount of revision. Concentrating on the measurement 
of creative artistic abilities, this test has a fairly high ceiling and shows 
adequate discrimination among applicants for admission to art schools. The 
test includes the following three parts: 


1. Scribble Exercise: The subject is directed to make outline drawings of 20 
simple objects, such as book or fork. within time limits of two to six seconds 
for each drawing. This test is designed partly to give the subject confidence 
and partly to determine quality of line, appreciation of 


a skill 
proportion, and ski 
in composition or arrangement of object on the p 


age. 


N 


- Doodle Exercise: The subject is required to draw simple abstract composi 
tions with given figures, such as six triangles, a rectangle divided by two ea 
and the like. This test bears a certain resemblance to the Graves Desig 


Judgment Test, although the subject now produces his own designs instea 
of judging given designs. 


r a s e 
3. Imagery: This test provides 12 rectangles, in each of which a few lines n 
been printed to act as "springboards" for artistic compositions. The RH 
sketches a picture in each rectangle, building upon the given lines. In Figur 


90 will be found one of the given rectangles, together with two different 
drawings made from the same initial lines. 


The Horn Art Aptitude Inventory is scored by me 
technique. Samples of excellent, average, and poor 
basis for rating the subject's drawings. As an addi 
manual lists certain factors to be considered, 


ans of the product scale 
work are furnished as ? 
tional scoring guide. the 
such as order, clarity 9' 
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Fig. 90. Sample Item from the “Image Test of the Horn Art Aptitude Inventory. 
The first rectangle shows the stimulus lines: the other two contain drawings made with 
these stimulus lines. In the second drawing, the card has been turned to a horizontal 
Position, (From Horn and Smith, 32, p- 351; reproduced by permission of American 


Sychological Association.) 

thought and presentation, quality of line, use of shading, fertility of imagina- 
tion, and scope of interests. Although the scoring still leaves much to sub- 
jective judgment, correlations of .79 to .86 are reported between the results 
Obtained by different scorers. 

Some indication of validity is provided by two preliminary studies con- 
ducted with the Horn test. Within a group of 52 art school graduates, a cor- 
relation of .53 was found between test scores and mean instructors' ratings of 
Performance in a three-year art course. The second study was conducted 
with 36 high school seniors enrolled in a special art course. In this group, the 
lest scores obtained at the beginning of the year correlated .66 with mean 
instructors’ ratings at the end of the course. A negligible correlation between 
Performance on the Horn test and intelligence test scores was found in the 
Previously mentioned group of 52 art school graduates. 
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As a measure of the more complex and creative aspects of artistic aptitude 
at a relatively high level, this test appears to have promise. It has, however, 
been criticized on the grounds that the scoring puts a premium on conformity 
to tradition in technique and composition, while great artists often deviate 
from the norm in these respects. Such a criticism could probably be directed 
against all current art aptitude tests. Whether à test can be devised to meas- 
ure the degree of originality characteristic of truly great art remains to be 
seen. In the meantime, many aspects of artistic aptitude can be measured. 
The skills involved may represent necessary though not sufficient. condi- 
tions for artistic production, : 

A more specific limitation of the Horn Art Aptitude Inventory is that it 
calls for a certain minimum of artistic training or experience. It could ob- 
viously be used as either an achievement or aptitude test. As a further il- 
š of such distinctions, it might be noted that 
yed by one investigator as a projective tech- 
nique in the diagnosis of personality characteristics (28). 


arding the present status of art aptitude tests as 
s P P! 


Specific art courses and measures of subsequent 
evidence suggests that terminal grades in 
worksample tests, such as the Lewere 
preciation tests, such as the McAdory 
however, may simply reflect the inf 
courses, 

Another fruitful approach to the f. 
the factorial analysis of artistic 


achievement. Some available 
art courses correlate higher with 
nz and the Knauber, than with art ap- 
and the Meier (3). These correlations: 
uence of earlier art training in other 


urther development of art tests is through 

aptitudes. Most, if not all, existing art Ap 
tude tests are based upon certain assumptions Tegarding the essential factors 
in artistic aptitude. An obj 


jective verification of these assumptions would be 
desirable. 


It would also be of interest to investigate furth 


ó 41 dif- 
er the effect of cultural di 
ferences upon performance on the y 


arious art tests. Some scattered data sug 
gest that these tests are restricted in their applicability to specific cultures. 


Certainly what is known about cultura] differences in artistic expression and 
artistic standards would Support such a view. An investigation of approx 
mately 300 Navajo Indian children with the McAdory test found the Indians 
to fall far below the norms of New York City whites, despite the high degree 


of artistic development that characterizes the Navajo Indian culture (51. ID 
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View of the nature of the McAdory Test this finding is hardly surprising. The 
application of the Meier Art Judgment Test to artists, art students, college 
Students, and other adult groups in Brazil likewise indicated the need for 
Certain revisions if the test were to be used in that culture (14). 


MUSICAL APTITUDES 


During the first four decades of the present century extensive research on 
the PSychology of music was conducted at the University of Iowa under the 
direction of Carl E. Seashore (51). One of the outcomes of these investiga- 
tions was the preparation of the Seashore Measures of Musical Talents (50, 
52). In its current form, this series consists of six tests covering pitch, loud- 
ness, rhythm, time, timbre, and tonal memory. Like most musical aptitude 
lests, the Seashore tests are reproduced on phonograph records for group 
administration and uniformity of presentation. 

Each item in the Seashore tests consists of a pair of tones or tonal se- 
quences. In the pitch test, the subject indicates whether the second tone is 
higher or lower than the first. The items are made progressively more difficult 
by decreasing the difference in pitch between the two tones in each pair. In the 
loudness test, the subject determines whether the second tone is Stronger or 
Weaker than the first. The rhythm test requires the comparison of rhythmic 
Patterns that are either the same or different within each pair. In the time 
lest, the subject records whether the second tone in each pair is longer or 
Shorter than the first. The timbre test calls for the discrimination of tone 
quality, the two tones in each pair being either the same or different in this 
respect. In the tonal memory test, short series of three to five tones are 
Played twice in immediate succession. During the second playing, one note 
is Changed, and the subject must write the number of the altered note, i.e., 
first, second, etc. 

The Seashore tests are applicable from the fourth grade to the adult level. 
The testing of younger children by this procedure has not proved feasible 

€cause of the difficulty of sustaining interest and attention. Even above the 
age of 10 the scores on these tests may be lowered by inattention. Conse- 
quently, the tests are not as reliable at these ages as they are for older sub- 
J€cts. The scores are not combined into a single total, but are evaluated sep- 
arately in terms of percentile norms. These norms are reported for grades 
4t0 5,6 to 8, and 9 to 16, the normative samples for each test and grade level 
Containing from 377 to 4319 cases. Age changes are slight and sex differ- 
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ences are negligible. The tests are probably somewhat susceptible to practice 
and training, although studies of these effects have yielded conflicting re- 
sults (18, 51). 

Kuder-Richardson reliability coefficients of the six Seashore tests range 
from .55 to .85 within the three normative grade-level groups. Only content 
validity is discussed in the manual, Seashore having argued over the years 
that this is the most appropriate type of Validity for such tests. It is un- 
doubtedly true that ability to discriminate pitch, loudness, timbre, and other 
properties of tones is essential to both appreciation and production of music. 
To predict musical achievement. however, we need to know much dw 
What is the minimum cutoff point for different kinds of musical activities? 
Is there any correlation between te 


measured by the tests, both in relation to each other and in relation to the 
entire array of requisite traits? A few Scattered studies provide meager 
evidence of predictive Validity against various Criteria of performance in 
music training (5, 40, 52). Many of these validity coefficients are low, few 
reaching .30 or 40. Apart from the unreliability Of criterion measures and 
the complexity of factors affecting musical achievement, it should be noted 


intelligence tests are negligible, as would be expected 
for special aptitude tests. Intercorrelations among the six tests are higher 
than had been anticipated (18). The functions measured by the die 
tests are thus less independent than had originally been supposed, à e 
that has also been confirmed by factor analysis (42). It should also be note 
that the Seashore tests or adaptations of them have proved helpful in p 
dicting achievement in certain civilian and military specialties requiring 


auditory discrimination, such as those of sonar operator and radiotelegrapÞe" 
(21, 58). 


f : os alee: 
as been used widely by music teachers in a 
asses is the Kwalwasser-Dykema Music Tes 


aspects of musical apprecia 
10 tests and for tot 
grade interval from 
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the Kwalwasser-Dykema manual is of little or no help. No mention is made 
of reliability or validity. Other investigators, however, have found that the 
reliabilities of some of these tests are so low as to render their scores prac- 
tically worthless (5, 18). Most of the tests are too short and contain too few 
discriminative items, especially in the middle of the difficulty range. The 
popularity of this battery probably stems from the fact that it seems to yield 
SO much information in so little time. But the information may be incorrect. 

An attempt to improve the discriminative value and reliability of the 
Kwalwasser-Dykema battery without changing the content is reported by 
Holmes (30). The range of scores on each test was greatly increased by re- 
quiring more differentiation in the responses and assigning partial credits in 
Scoring. For example, instead of merely indicating whether two notes are the 
Same or different in pitch, the subject reports whether the second note is 
qual to the first. different, different and higher, or different and lower. By 
these Procedures, the reliability of the entire battery was raised to .91. Sub- 
test reliabilities, although consistently higher than in the old version, ranged 
from .43 to .88. 

A more recently developed battery that concentrates on only two funda- 
Mental components of musical aptitude is the Drake Musical Aptitude Tests 
(1S). These tests are designed for use at ages 8 and over. One part measures 
Musical memory by presenting a two-bar melody which the subject must com- 
Pare from memory with other versions. If the version is unchanged, the sub- 
J€ct indicates so. If it is altered, he must state whether the change was in the 
key, the time, or the notes. Preliminary illustrations are used to familiarize the 
Subjects with the meaning of these musical terms. The other part is a rhythm 
test designed to measure the subjects ability to keep time. This test does 
not appear to measure the same ability as the rhythm test in the Seashore 
Series and correlates low with that test. The memory and rhythm tests of 
the Drake battery also have low correlations with each other. Both are re- 
Ported to have high reliabilities, in the .80's and .90's, and unusually high 
Predictive validities against a general criterion of subsequent achievement in 
Music training. These tests are promising and merit further validation studies 
Y Other investigators. 

A Somewhat more comprehensive battery is the Wing Standardized Tests 

Musical Intelligence (62), developed in England. These tests too may be 
Used from age 8 on. Like the Drake tests, the Wing tests depart from the 
atomistic” sensory approach of the Seashore tests and make use of musi- 
“ally Meaningful content. Piano music is utilized in each of the seven tests, 
Which cover chord analysis, pitch change, memory, rhythmic accent, har- 
mony, intensity, and phrasing. The first three tests require sensory discrimina- 
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tions, but at a somewhat more complex level than in the Seashore tests. In 
the other four, the subject compares the aesthetic merit of two versions. 
Thus the battery places considerable emphasis on music appreciation. 

Norms are provided for total scores on the entire Wing battery. Such a 
use of total scores is supported by the identification of a general factor of 
musical ability in factorial analyses of music tests (42, 61). This factor, de- 
scribed as the cognitive aspect of musical ability, accounted for 30 to 40 per 


cent of the total test variances. For older children and adults, both retest 
and split-half reliabilities of total scores on the Wing battery are in the .90's. 
Preliminary studies of validity in small groups yielded correlations of 60 or 
higher with teachers’ ratings of musical ability. The tests have high ceilings 
and may find their greatest usefulness in the selection of musically talented 
individuals for further training. a 

Mention may also be made of various attempts to measure musical attl- 
tudes and interests by means of questionnaires or self-ratings. An example 15 
the Seashore-Hevner Tests for Attitude toward Music (53), patterned after 
the Thurstone scales to be discussed in Chapter 19. More recently, Farns- 
worth (19) developed a similar series of Scales for rating interest in differ- 
ent types of music. In so far as emotional reactions play a significant part iF 
both the appreciation and production of music, the measurement of musica 


5 €" r in 
Interests would seem to be as Important as the measurement of aptitudes 
this area. 


LITERARY APPRECIATION 


Literary aptitude, like aptitude in music or in the graphic arts, i oap 
a multiplicity of skills. Among the measuring instruments employed in ts 
area are to be found tests of literary information; tests concerned with gen 
mar, spelling, word knowledge, and other mechanics of writing; proun 
scales; and tests of literary appreciation. The first two types are predomi- 
nantly achievement tests which are used to measure the effects of specifi 


then graded by “matching” 
prepared scale. The raters are 
such as grammar, spelling, 


5 :ously 
each with a sample product in a previa 

: p ^ ck, 
usually given a list of specific points to ears 
word choice, Organization, clarity, and origin ity 
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In so far as creative qualities and other aesthetic characteristics of the 
writing are judged, such product scales would fall within the scope of the 
present chapter. In most of these tests, however, the major emphasis is placed 
upon the individual's ability to use language as a means of factual communi- 
cation, rather than as an aesthetic medium. Moreover, such tests are em- 
ployed principally to appraise the educational progress of school children or 
college students, rather than to select individuals with special literary talents. 
Thus this type of test, as well as the literary and grammatical information 
lests, fits more properly into the content of Chapters 16 and 17, which are 
Concerned with achievement tests. 

Tests designed to measure literary appreciation, on the other hand, have 
much in common with the tests of music and art appreciation discussed in 
the preceding sections of this chapter. To be sure, some of the tests labeled as 
measures of literary appreciation are in effect tests of comprehension or in- 
formation. With these instruments we shall not be concerned at present. A 
Number of attempts have been made, however. to apply to the literary 
field the same basic technique used by Meier. Graves, Wing, and others in 
the development of art and music appreciation tests. 

An early test developed in 1921 by Abbott and Trabue (1) established 
the pattern for this approach to the measurement of literary appreciation. 
Each item in this test consisted of a short poem or part of a poem, presented 
in the original and three distorted versions. The three distortions included a 
sentimental version, in which emotion was falsified by introducing silly. 
Sushy, affected, or otherwise insincere feelings; a prosaic version, in which 
imagery was reduced to a more pedestrian and commonplace level; and a 
Metrical version, in which movement was rendered awkward or less fine 
than in the original. In taking the test, subjects were instructed merely to 
Mark the selection they liked best in each set. 

The Rigg Poetry Judgment Test (49), published more than twenty years 
later, utilized essentially the same approach as the Abbott-Trabue. In the 
Rigg test, only two versions are given in each item, and the number of items 
has been increased. An example is shown in Figure 91. This test, which is 
available in two parallel forms, provides norms for high school, college. 
and adult "expert" groups. although it appears to be too difficult for the 

igh school level. Parallel-form reliabilities in the .70's are reported for high 
School and college groups. When scores from both forms are combined, the 
reliabilities rise i the .80's. Validity was “built in" to the test by item selec- 
tion Procedures similar to those followed in the construction of the Meier 
Art Judgment Test, but no other evidence of validity is presented. The Ri 


E 
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test has been criticized by teachers of English because the excerpts used 
are so short that judgments tend to be based on minor det 
cal aspects of the poetry. 


ails and mechani- 


Out of the hills of Habersham, 
Down the valleys of Hall, 

I hurry amain to reach the plain, 
Run the rapid and leap the fall. 


My source is in the Habersham Hills, 
Thence down the valleys of Hall, 

I flow so fast o'er rocks and rills 

To reach at last the waterfall. 


The subject indicates which selection he regards as the better poetry. 


Fig. 91. Sample Item from Rigg Poetry Judgment Test. (Reproduced by permission 
of Melvin C. Rigg.) 


A similar technique has been applied by Carroll (7) to the appreciation 
of prose. Each item in the Carroll Prose Appreciation Test contains four short 
Passages, approximately equated in length and subject matter, but taken 
from four sharply differentiated sources, Unlike the Abbott-Trabue test 
however, only one version was specially prepared for testing purposes. TA 
four sources of test materials include top-ranking classic literature, books 
generally regarded as being of mediocre or poor quality, stories from pulp 
magazines, and a deliberate mutilation written specially for the test. An iF 
lustrative item will be found in Figure 92. In this test, the subject is Te 
quired to rank the four passages for literary merit, giving a rank of | to the 


! j R : system 
best and a rank of 4 to the poorest in each set. The score is based on a systen 
of partial credits. 


Three levels of the Carroll test were prepared, for use with junior high 


school, senior high school, and college students. Percentile norms are give” 
separately for each grade. Reliability coefficients of the order of .70 Lem 
obtained by both split-half and retest techniques. As in many tests of artist 
appreciation, validity was considered in t 
opinion, and performance of contrasted 
nificant diflerences in mean Scores were found be A 
college groups, as well as between college and adult : 
tion, the scores increased regul 
school group. 

It has been objected that in this test, 


passages are too brief to permit the 
tion. As 
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in terms of their stylistic characteristics. Broader questions of literary 
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AN INTERIOR 


A 


I went with the little maid into a gorgeously decorated bedroom, all of cream 
color and light blue that blended prettily. The bed was a great, wide affair of 
beautifully carved and ornamented wood, painted creamy white with blue and gold 
trimmings. There was a wonderful bureau and a dressing table to match, and in 
one corner of the room a mirror that went from floor to ceiling. I had to hold my 


breath. 


B 


Lollie had never seen such a pretty room, and it made her gasp to see how 
pretty the furniture was, as well as how pretty the rugs were, and the curtains at 
the windows and the pictures on the wall, but what she really liked best was that 
furniture, for it looked comfortable as well as pretty, and she knew it must have 
cost hundreds and hundreds of dollars. She wished she could live and die in that 


one room, it was so pretty. 


Cc 


An air of Sabbath had descended on the room. The sun shone brightly 
through the window, spreading a golden lustre over the white walls; only along 
the north wall, where the bed stood, a half shadow lingered . . . The table had 
been spread with a white cover; upon it lay the open hymn book, with the page 
turned down. Beside the hymn book stood a bowl of water; beside that lay a piece 
of white cloth . . . Kjersti was tending the stove, piling the wood in diligently . . . 
Sorine sat in the corner, crooning over a tiny bundle; out of the bundle at inter- 
vals came faint, wheezy chirrups, like the sounds that rise from a nest of young 


birds, 


D 


Major Prime had the west sitting-room. It was lined with low bookcases, full 
of old, old books. There was a fire-place, a winged chair, a broad couch, a big desk 
of dark seasoned mahogany, and over the mantel a steel engraving of Robert E. 
Lee. The low windows at the back looked out upon the wooded green of the as- 
cending hill; at the front was a porch which gave a view of the valley. 


The subject ranks the four selections in order of literary merit. In this item, 
the correct order is as follows: 1-C, 2-D, 3-A, 4-B 


Fig. 92. Sample Item from Carroll Prose Appreciation Test. (Reproduced by per- 
Mission of Educational Test Bureau. Minneapolis.) 
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criticism, such as portrayal of characters, or Principles of organization em- 
ployed in developing a plot, are obviously excluded. The passages ae 
probably long enough, however, to permit judgment of many important as- 
pects of writing. l 
A test for the appreciation of poetry, developed in England, illustrates still 
another variation of the same basic procedure (17). 
from representative poems of high liter 
portion omitted. Three alternative com 
original and two weakened variants. In each item, the subject must choose 
the best completion. Analysis of the performance of 600 subjects, including 
secondary school pupils as well as students, research workers, and staff mem- 


bers of two large universities, indicated good differentiation with regard to 
age and educational level. 


From this brief review of the 


In this test, excerpts 
ary quality are presented with some 
pletions are provided, including the 


different types of aptitude tests in literature. 
music, and graphic art, it is apparent that much remains to be done in this 
field. The available tests are few in number and technically crude. At the 
same time, several ingenious devices and promising approaches have been 
developed that warrant further exploration. 


CREATIVITY AND REASONING 


Of considerable relevance to the understanding of artistic talent is the 
relatively new area of research on creativity. The large majority of investi- 
gators in this field have been concerned primarily with creative talent 1n 
science and engineering, but some attention has also been given to creative 
achievement in the arts. Many studies dealing with reasoning—in the 


: SENS 
broad sense of problem-solving at à complex level—overlap the area of cre? 


tivity. In the research literature, the two terms are often used to refer tO 
very similar activities. 

In a general discussion of the problem 
fact that creative talent is not s 


ation of creativity to ideational fluency, inductive 
Special attention was ssp 
ntellectual, temperamental factors. PE 
aged by a receptive as contrasted to * 
as well as by relaxed, dispersed atten 
active concentration on à problem. Several studies have k^ 
proached the problem of creativity through factorial analyses of batteries © 
tests designed to measure various aspects of Creative talent. Some corre- 
spondences can be discerned between the factors isolated through thes’ 
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different analyses, although the agreement is far from complete (2, 6. 13, 27, 
43, 44). 

The most extensive factorial investigation of creativity is that conducted 
by Guilford and his associates under the auspices of the Office of Naval 
Research (24, 25, 26, 27, 33, 60). This project set out to explore four areas 
of thinking, designated as reasoning, creativity, planning, and evaluation. 
Many new types of tests were developed in the course of the study and were 
administered, together with tests of previously established factors, to groups 
of students and military personnel. In the factorial analyses, several familiar 
factors reappeared, such as Verbal Comprehension, Numerical Facility. 
Spatial Orientation, Spatial Visualization, Perceptual Speed, General Reason- 
ing (with heavy loading on arithmetic reasoning tests), and several memory 
factors, But many new factors were identified in the course of the study. 

The broader implications of Guilford's research for the nature of intelli- 
gence and for the number and organization of intellectual factors were 
Considered in Chapter 13. For the present purpose, interest centers especially 
9n the various fluency, flexibility, and originality factors found to be most 
Closely associated with creativity. These factors come under the heading 
of "divergent thinking,” which Guilford describes as “the kind that goes off 
in different directions” (26, p. 381). Such thinking permits changes of direc- 
tion in problem-solving and leads to a diversity of answers. In contrast, "con- 
Vergent thinking" leads to a single right answer determined by the given 
facts (26, p. 376). The various divergent-thinking factors identified in 
Guilford's studies can be illustrated by examining some of the tests found to 
be heavily loaded with each factor (26, pp. 381-390). A few of these 
lests have been published and are available for distribution. Others will ap- 
Pear in the near future. Until more data are gathered about these tests, 
however, all should be regarded as research instruments only. 

A test of word fluency (10) requires the subject to write words containing 
a given letter. In this as in all fluency tests, the score is simply the total 
Number of acceptable responses written in the time allowed. Other measures 
Of this factor call for words beginning with a specified prefix or rhymes for 
à given word. There is some evidence that performance on word fluency 
tests is correlated with creative achievement of college students in science 
and art courses (16). In one ideational fluency test (10), the subject must 
nàme things that belong in a certain class, such as fluids that will burn. In 
another, he lists different uses for a common object, such as a brick or pen- 
cil. Associational fluency is illustrated by a test calling for all words similar 
in meaning to a given word. such as “hard” (10). Words for this test were 
Chosen because cach has a variety of meanings. Another test requires the in- 
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sertion of an adjective to complete each simile (e.g.. "As 
Expressional fluency can be measured by a test of four- 1 
in which the subject writes four connected words, the first letters of which 
are given (10). For example, if given Y c t 


d—, the subject could write, “You can throw dice.” The subject con- 


as a fish”). 
word combinations, 


Among the tests having high loadings in the various flexibility factors iden- 


Pictures, Hidden 
Figures, and Match Problems (26, p. 386). In the first of these tests, the 
problem is to find concealed faces whose li 


nes form parts of larger ob- 
jects in a picture. In the second, illustrated in Figure 93, the subject must 


ea 


es 
2e4 


Item 1 


| Item 2 
i 


s ; r 
Fig. 93. Sample Items from the Hidden Figures Test. Which of the five simple 
figures at the top is concealed in each of ti 


n 
: zs he item figures? Answers: 1, A; 2, D. (From 
Guilford, 26, p. 386. Copyright, 1959, McGraw-Hill Book Company, Inc.) 

identify a simple geometric fj 
Match Problems test require 
sticks to leave a given num 


gure embedded in a m 
$ the removal of 


ber of Squares or 
this test is reproduced in Figure 94, In 


requires freedom from 
the given stimuli. 


ore complex figure. D. 
a specified number of mate? 
triangles. One example ene 
all of these tests, a good performance 


" E ing of 
persistence of approaches, permitting a restructuring 


Originality can be measured by an 
tion test, in which the subject must 
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tives, salesmen, teachers, and politicians (39). Another example is the Con- 
Sequences test (12), which provides separate scores in ideational fluency and 
Originality. In this test, the subject is told to list as many different consequences 
of some hypothetical event as he can. For example, “What would be the 
results if people no longer needed or wanted sleep?” Responses are classified 
as obvious or remote according to rules given in the manual. The number of 
obvious responses provides the ideational fluency score; the number of remote 
responses, the originality score. 


ese 


Fig. 94, Sample Item from the Match Problems Test. If each line is a match, can 
You take away four matches in A, leaving three squares and nothing more? If the sub- 
J€ct works under the assumption that all squares must be of the same size, he would be 
Unable to reach the correct solution, shown in B. (From Guilford, 26, p. 387. Copyright, 


1959, McGraw-Hill Book Company, Inc.) 


. Tests of originality similar to those mentioned above have yielded prom- 
ising criterion correlations in some exploratory investigations. Correlations 
between .30 and .55 have been reported between such tests and teachers’ 
ratings for creativity of science and art students (16), as well as ratings for 
Originality of military officers (4). Reference should also be made to the 
Ingenuity test in the FACT battery, discussed in Chapter 13. In a series of 
preliminary validation studies, scores on this test yielded concurrent validity 
Coefficients of .35 to .50 with criteria of originality in high school art classes, 
and coefficients of .28 to .46 with similar criteria in high school English 
classes (20, pp. 49-51). 

One investigation was concerned with the effects of varying time limits on 
Performance H the open-end, free-response type of test used to measure 
Creativity (11). Although simple recall tasks show a decreasing production 
Tate with time, the more inventive or creative tasks show a relatively con- 
Stant rate of production within the time limits investigated. Qualitatively, 
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both uncommonness and remoteness of response increase with time. — 
with longer time limits, subjects gave a larger proportion of responses rate 
high in these aspects of originality. - 
Although tests of fluency, flexibility, and originality such as those eite 
above probably come closest to measuring the essential aspects of creativity, 
other abilities are undoubtedly needed for effective creative achievement, €s- 
pecially in the sciences. A number of cognitive and evaluative aptitudes usu- 
ally classified under reasoning are clearly relevant. Two available tests in 
this area that have emerged from the Guilford project are Logical Reason- 
ing (29), composed of items in Syllogistic form, and the Ship posit 
Test (9), which is heavily loaded with the general reasoning factor an 
utilizes problems similar to those found in arithmetic reasoning tests. An 
earlier test designed to tap several aspects of effective reasoning is the ge 
son-Glaser Critical Thinking Appraisal (59). Designed for high school an 
college levels, this test contains five parts, dealing with inference, recogni- 
tion of assumptions, deduction, interpretation, and evaluation of arguments. " 
In discussing creative productivity in the arts, Guilford (25) suggests tha 
several of the fluency, flexibility, and originality factors so far identified E 
play an important part. Those in the verbal area, with which many of b 
available tests are concerned, are probably related to creative writing- 


Corresponding factors pertaining to visual, 


s ; tic 
auditory, or even kinaesthe 
“figural” 


content—many of which have not yet been identified—may a 
an equally important part in the graphic arts, music, and choreography’ 4 
addition, creative productivity in the arts, as in the sciences, undoubtedly 
requires also a certain minimum proficiency in relevant comprehension aM 
memory factors, such as verbal comprehension, spatial orientation, visu 
or auditory memory, and the like, 
It is too early to know what will be the fin 
on the nature of creativity. One point appear: 
however. Investigations of scientific t 


al outcome of current research 
$ to be fairly clear at this jen 
alent are becoming increasingly sod 
has shifted from the individual a 
ical thinker to the one who also pi 
tic production, is coming more and 


= aleg likely 
ment as well. It is also lik 
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CHAPTER l6 . .— 


Achievement Tests: General 


In sheer numbers, achievement tests Surpass all other types of standard- 
ized tests. The principal object of achievement tests is to appraise the effects 
of a course of instruction or training. Although such tests find their most 
extensive application in education. they are not restricted to school work: 
Achievement tests have also been developed for measuring the results of 
specialized vocational training and experience in many types of jobs. . 

It is customary to contrast achievement tests with aptitude tests, the en 
including general intelligence tests, multiple aptitude batteries, and 
aptitude tests. From one point of view, the difference between achiever?" 
and aptitude testing is a difference in the degree of uniformity of relevan 
antecedent experience. Thus achievement tests measure the effects of rela- 
tively standardized sets of experiences, such as a course in elementary 
French, solid gcometry, or Gregg shorthand, In contrast, aptitude test P 
formance reflects the cumulative influence of a multiplicity of experiences - 
daily living. We might say that aptitude tests measure the effects of learning 
under relatively uncontrolled and unknown conditions, while achievement 


& ji n 
tests measure the effects of learning that occurred under partially know 
and controlled conditions. 


d if 
à $ ment tests, as contrasted to those followed 
validating aptitude tests, 


It should be recognized, however 


- itude and 
» that no distinction between aptitu 
424 
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achievement tests can be rigidly applied. Some aptitude tests may depend 
upon fairly Specific and uniform prior learning, while some achievement 
tests cover relatively broad and unstandardized educational experiences. 
Similarly, any achievement test may be used as a predictor of future learn- 
ing. As such it serves the same purpose as an aptitude test. For example, the 
Progress a pupil has made in arithmetic, as determined by his present 
achievement test score, may be employed to predict his subsequent success in 
algebra. Achievement tests on premedical courses can serve as predictors of 
Performance in medical school. Whenever different individuals have had the 
same or closely similar courses of study, achievement tests based on such 
Courses may provide efficient indices of future performance. 

In differentiating between aptitude and achievement tests, we should espe- 
cially guard against the naive assumption that achievement tests measure 
the effects of learning, while aptitude tests measure “innate capacity" in- 
dependent of learning. This misconception was fairly prevalent in the early 
days of Psychological testing, but has been largely corrected in the subse- 
quent clarification of psychometric concepts. It should be obvious that all 
Psychological tests measure the individual's current behavior, which inevi- 
tably reflects the influence of prior learning. The fact that every test score has 
a "past" does not, however, preclude its having a "future." While revealing 
the effects of past learning, test scores may, under certain conditions, serve 


aS predictors of future learning. 


USES AND MISUSES OF ACHIEVEMENT TESTS 


Achievement tests are currently employed in education, business and in- 
dustry, civil service, and the armed forces. They also constitute a part of the 
"rmamentarium of the counselor and the clinical psychologist. In all these 
fields, they may serve a variety of functions. 

Uses, Achievement tests are frequently employed to check the attainment 
9f minimum performance standards. Is the industrial or military trainee 
Teady for 4 specific job assignment? Is the applicant qualified for a license 
to drive a car, pilot a plane, or practice medicine? This application of 
achievement tests represents an all-or-none appraisal of current status. 

Selection is another function for which achievement tests are often em- 
Ployed, Such tests play a major role in the hiring of applicants for a wide 
Variety of specialized industrial jobs. In civil service employment procedures, 
hs development and application of many kinds of achievement tests repre- 
Sent 4 gigantic undertaking. The periodic administration of many thousands 
of educational achievement tests in connection with the admission of students 
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to colleges. graduate schools, and professional schools is well known. Certain 
scholarship programs likewise include competitive 
as one instrument for the selection of the most 

Placement and classification. re 


achievement examinations 
promising candidates. 


present other types of decisions utilizing 
achievement tests. In this connection, the uses of 


achievement tests range 
from the classification of military p 


ersonnel in terms of previous job training 
and experience, to the “sectioning” or ability grouping of elementary school 
children. The practice of subdividing school classes into relatively homoge- 
neous ability sections has been followed for several decades, Sectioning on the 
basis of over-all educational achievement or aptitude, however, has Heer 
criticized on the grounds that it ignores trait differences within the individ- 
ual. For this reason, the use of achievement tests in different areas, as à 
means of adapting instruction to individual ability patterns, has been strongly 
advocated (cf., e£ 35, Ch. 15, 

Achievement tests are an im 
the individual's current skills a 
educational 


portant tool in counseling. An appraisal of 
nd knowledge is an obvious first step in the 
and vocational planning that constit i ol 
the counseling situation, Similarly, achievement tests have a place in clinic? 
practice. In the diagnosis of individuals with reading disabilities or other 
€ need for achievement tests is self-evident. 


analyses of academic deficiencies. Many other types ` 
ation of achievement tests : 
and delinquency, for example. pim 
adjustment to the School situation may be ws 
tributing factors, Similarly, emotional maladjustments among intellectua y 


m ; 2 ca- 
gifted children are sometimes found to be associated with improper educ? 
tional placement. 


The many roles tha 


cational failures and mal 


ting 


t achievement tests can pl t 
ien 


of the school itself have long been recognized, 
of grades, such tests have the advant 
Properly constructed, they 
coverage and reductio 
marking 


ay within the specific set 
As an aid in the assign 
ages of objectivity and uniformity: at 
have other Merits, such as adequacy of ae 
n of the operation of irrelevant and chance factors 


d 
H . TS H S ere 
S, the periodic administration of say 
achievement tests serves to facilitate learning. Such 


" E TE A -nin£- 
reveal weaknesses in past learning, give direction to subsequent learniz 
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and motivate the learner. The incentive value of "knowledge of results" has 
been Tepeatedly demonstrated by psychological experiments in many types 
of learning situations, with subjects of widely varying age and education. The 
eflectiveness of such self-checking is generally heightened by immediacy. 
Thus when achievement examinations are employed primarily as a learning 
aid, it is desirable for the students to become aware of their errors as soon 
after taking the test as possible. 

The use of achievement tests as learning devices is highlighted by the de- 
velopment of “teaching machines,” cited in Chapter 3. Such machines may 
Use apparatus for exposing items and recording responses, or a punchboard 
in which the subject punches holes on an answer sheet to indicate his choice 
Of response, or even simpler paper-and-pencil materials. A feature com- 
Mon to all teaching machines is the provision for immediate self-scoring or 
“feedback.” A number of investigations, principally with college students, 
have demonstrated a significant superiority in learning by groups using these 
Self-scoring training devices, in comparison to control groups devoting the 
Same amount of time to more traditional learning procedures (cf. 19, 29, 
41, 43, 44, 45), 

Finally, achievement tests may be employed as aids in the evaluation of 
teaching, the improvement of instructional techniques, and the revision of 
Curriculum content. Achievement tests can provide information on the ade- 
JUACy with which essential content is being covered. In situations demanding 
Uniformity of training, as in the military services, such uniformity can be 
assured by the administration of a common test. Achievement tests can like- 
Vise indicate how much of the course content is actually retained and for 

ew long. Are certain types of material retained longer than others? What 
üre the most common errors and misunderstandings encountered? How well 
can the learners apply their knowledge to new situations? By focusing atten- 
tion Upon such questions and by providing concrete facts, achievement tests 
Stimulate an analysis of training objectives and encourage a critical exam- 
“Nation of the:content and ‘methods of instruction, The growth of Fall testing 
Programs points up the increasing use of test results as a basis for planning 
What is to be taught to a class as a whole and what modifications and ad- 
Justmetits nged t5. be made in individual cases, By giving tests at the be- 
Sinning of the school year, constructive steps can be taken to fill the major 
8aps in knowledge revealed by the test results. 

isuses, The possible dangers inherent in the unwise application of achieve- 
Ment tests have been as vigorously expounded by educators as have the 
Merits of Such tests. One of the strongest objections to the use of achieve- 
Ment tests pertains to the excessive standardization of instruction that may 
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thereby be encouraged. In the learning of elementary skills, as well as in a 
number of vocational and military training situations, such standardization 
may be a desirable goal. In many other types of learning, however, €n- 
forced uniformity is objectionable because it tends to stifle spontaneity. 
creativity, and original thinking. Moreover, excessive standardization ignores 
individual differences in both learners and instructors. The adaptation of in- 
struction to local needs and conditions is also incompatible with such a high 
degree of standardization. 

Another objection raised against the indiscriminate use of achievement 
tests centers about the dangers of test-oriented instruction. If achievement 
tests were so designed as to cover all important goals of education. each 
weighted in proportion to its importance, this criticism would lose much of its 
force. Achievement tests, however, tend to overemphasize certain types of 
learning and neglect others. Not all educational objectives are equally 
amenable to standardized testing. Despite these limitations of tests, instru®- 
tors and students are frequently motivated to concentrate upon those aspects 
of a course that will lead to a better achievement test score. The fact that 
administrators sometimes place undue emphasis upon test performance also 
encourages such an attitude. 

Like any other type of test, achievement tests should be regarded as tools: 
not goals. Moreover, it must be remembered that they provide only partit 
information and need to be supplemented by other observations. ja or 
respect, too, they resemble other tests. When properly used, with due regar 
to their limitations, they provide an effective instrument for many purpose 
It should also be added that achievement tests are constantly being improv" 
in a number of ways. Thus techniques are being developed for testing 
more complex and more creative aspects of learning. Although it is relati 
ee ee Ta Rut penes nd 

à : ge to new situations, and similar functions: — 
not impossible. In fact, one of the most conspicuous changes in education 
achievement tests since the 1940's has been the development of tests 


measuring the attainment of broader educational goals formerly regarded ; 
inaccessible to objective eval in the last 
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examinations could be improved by the judicious application of sound test- 
construction procedures. Many treatises have been written on the question of 
how to construct achievement tests. Most of the techniques described in these 
publications are common to the construction of any type of test. Several of 
the discussions. however, have been specially oriented toward the prepara- 
tion of achievement tests for use in education, civil service, or the armed 
forces. Some are specifically concerned with techniques for improving in- 
formal, teacher-made examinations. The major steps in constructing an effec- 
tive classroom test may be summarized under three headings: (a) planning 
the test, (b) item writing, and (c) item analysis. 

Planning the Test. The test constructor who plunges directly into item 
Writing is likely to produce a lopsided test. Without an advance plan. some 
areas will be overrepresented while others may remain untouched. lt is 
generally easier to prepare objective items on some topics than on others. 
And it is easier to prepare items that require the recall of simple facts than 
to devise items calling for critical evaluation, integration of different facts, 
or application of principles to new situations. Yet follow-up studies have 
shown that the factual details learned in a course are most likely to be for- 
gotten, while the understanding of principles and their application to new 
Situations show either no retention loss or an actual gain with time after 
Completion of a course (54, 57). Thus the test constructed without a blue- 
Print is likely to be overloaded with relatively impermanent and less impor- 
tant material, Many of the criticisms of objective tests stem from the common 
©veremphasis of rote memory and trivial details in poorly constructed tests. 

To guard against these fortuitous imbalances and disproportions of item 
Coverage, test “specifications should be drawn up before any items are pre- 
Pared, For classroom examinations, such specifications should begin with an 
Outline of the objectives of the course as well as of the subject matter to be 
Covered. In listing objectives, the test constructor should ask himself what 
Changes in behavior the course was designed to produce. Such changes may 
Pertain to attitudes, interests, interpersonal relations, and other emotional 
Or motivational characteristics, as well as to the acquisition of knowledge 
and the development of intellectual skills. 

An unusually thorough analysis of educational objectives in the cognitive 
domain can be found in the Taxonomy of Educational Objectives (4). 
Prepared by a group of specialists in educational measurement, this hand- 
00k also provides examples of many types of items to illustrate the test- 
Ng of each objective. The major categories in this taxonomy include knowl- 
dge (in the sense of remembered facts. terms, methods, principles, etc.), 
Comprehension, application, analysis, synthesis, and evaluation. Each broad 
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objective is broken down into finer and finer subdivisions and sees 
from a variety of fields. Some interesting similarities can be noted ning 
this taxonomy of educational objectives and the schema developed 7 
Guilford for classifying the abilities identified through factor analysis (cf. 
Ch. 13). But the parallel is by no means complete. 

The specifications drawn up in planning a cl ets: 
topics to be covered, the kinds of learning to be tested (in terms oF age 
tives), and the relative importance of individual topics and objectives. ^ 
this basis, the number of items of each kind to be prepared on each topie 
can be established. The most systematic way of setting up such specifications 
is in terms of a two-way table, with objectives across the top and topics in 
the left-hand column. Not all cells in such a table, of course, need to have 
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items, since certain kinds of learning may be unsuitable or irrelevant for 
tain topics. 


Item Writing. The test constructor mus 


ate item form for his material. The advantages traditionally cited in RA 
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As the art of item writing develops, more and more of the abilities for- 
merly believed to be amenable only to essay questions are proving to be 
testable by Objective items. Among the chief advantages of Objective items 
are case, rapidity, and objectivity of scoring. Since each objective item re- 
quires much less of the examinee's time than does a typical essay question, 
Objective items also permit a fuller coverage of content and hence reduce 
àn important source of chance errors in total scores. 

Among objective items, there is a choice of several specific forms, such as 
true-false, multiple-choice, completion, matching, and arrangement (in order 
of magnitude, chronological order, etc.). The content and type of learning 
to be tested would largely determine the most appropriate item form. Multi- 
Ple-choice items have proved to be the most widely applicable. They are 
also easier to score than certain other forms, and reduce the chances of cor- 
Tect guessing by presenting several alternative responses. 

Many practical rules for effective item writing have been formulated on 
the basis of years of experience in preparing items and empirical evaluation 
9f responses; Anyone planning to prepare objective items would do well to 
Consult one of the books summarizing these suggestions. such as Bean (3: 
Furst (18), Travers (51). or Wood (58). In addition, several books on the 
Senera] subject of achievement tests or on the use of tests in education con- 
tain one or more chapters on item writing for classroom examinations tef. 
C8., 22, 25, 26, 30, 35, 39, 42, 48, 52, 53). Of considerable help, too, are 
Published collections of items. Gerberich (23) provides a detailed classifica- 
tion of items designed for different purposes and illustrates his discussion 
With over 200 items, A collection of over 13,000 items on various sciences, 
Suitable for college and high school levels, was assembled by Dressel and 
Nelson (13). References to published item collections in special fields, 
ranging from accounting and art appreciation to world history and zoology, 
can be found in Furst (18) and Gerberich (23). 

To add one more summary of item-writing "rules" to the many already 
available would be redundant. However, a few examples will be given to 
illustrate the kind of pitfalls that await the unwary item writer, Ambiguous 
Or unclear items are a familiar difficulty. Misunderstandings are likely to 
Occur because of the necessary brevity. It is very difficult to write a single 
Sentence that can stand alone with clarity and precision. In ordinary writing, 
any obscurity in one sentence can be cleared away by the sentences that 
ollow, But it requires unusual skill to compose isolated sentences that can 
Carry the whole burden unaided. The best test of clarity under these cir- 
cumstances is to have the statement read by someone else. The writer, 
Who knows the context within which he framed the item, may find it difficult 
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why they chose it. In item 5, the fault seems to lie in the wording either of the 
stem or of the correct alternative, since the students who missed the item 
were uniformly distributed over the four wrong options. Item 7 is an un- 
usually difficult one, which was answered incorrectly by 15 of the High and 
all of the Low group. The slight clustering of responses on incorrect option 
3 suggests a superficial attractiveness of this option, especially for the more 
easily misled Low group. Similarly, the lack of choices of the correct response 
(option 1) by any of the Low group suggests that this alternative was 
worded so that superficially or to the uninformed it seemed wrong. Both of 
these features, of course, are desiderata of good test items. Class discussion 
might show that item 7 is a good item dealing with a point that few class 
members had actually learned. 
These four items have been selected to illustrate the types of information 
that may be revealed by an item analysis, as well as the decisions to which 
they may lead. It should be emphasized that items ought not to be discarded 
merely on the basis of the statistical evidence. Not only may the cause of the 
unusual Statistical findings lie in the teaching rather than in the test, but an 
item may also show negligible or negative correlation with total score because 
of heterogeneity of test content. Dropping such items would "overpurify" the 


test and reduce content coverage. For example, if a test contains 10 items 
Tequiring computational skills a ng numerical reasoning, the first 
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problems are encountered regarding the reliability of achievement tests, the 
same techniques being employed as in all other tests. In connection with 
validity and norms, however, certain special points may be noted (cf. 11, 
47). 

Validity. It will be recalled that content validity (Ch. 6) finds its principal 
application in the evaluation of achievement tests. When applied to educa- 
tional achievement tests, such validity is often called curricular validity. Es- 
sentially, this type of validation is based upon the original selection of 
items to be included in the test. In this respect it is similar to some of the 
validation procedures followed in the development of artistic appreciation 
tests (Ch. 15). The preparation of test items is preceded by a thorough and 
systematic examination of relevant course syllabi and textbooks, as well as 
by consultation with subject-matter experts. On the basis of the information 
thus gathered, test specifications are drawn up for the item writers. The re- 
sulting blueprint is similar to that described in the preceding section on 
teacher-made tests, but more elaborate. 

In judging the content validity of an achievement test, the first question is 
whether the test covers a representative sample of the curricular content. 
Does it measure the extent to which the objectives of the curriculum have 
been achieved? Is the relative weight given to different objectives and 
topics satisfactory? An equally important question pertains to the exclusion 
of irrelevant variables. Thus a valid mathematics test should not measure 
reading ability. Nor should a test of creative writing measure speed. Several 
empirical procedures for checking the possible contribution of such extra- 
neous factors were cited in Chapter 6 (cf. also 27). 

A number of supplementary statistical analyses are sometimes reported in 
test manuals to provide additional information on the construct validity of an 
achievement test. Scores on the test may be correlated with other achieve- 
ment tests, aptitude tests. grades, and ratings within a given curricular area. 
Factorial analyses of such measures will help to define the field covered by 
the test in terms of the factorial composition of its scores. Both quantitative 
and qualitative analyses of errors made on the test are another source of 
relevant information. Grade progress in test scores is frequently investigated 
as a further approach to validation. This is similar to the age progress cri- 
terion used in the development of certain intelligence tests, In achievement 
tests, an item is retained if the percentage of children passing it increases 
from the lower to the higher grades. Items showing the largest grade incre- 
ments in percentage passing are preferred. Those showing no change or ir- 
regular variations are discarded. Although probably satisfactory for skill sub- 


jects such as reading. arithmetic, and language usage, the grade progress 
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criterion may be inappropriate for some of the content areas. For example, if 
American history is taught in one grade but not in the next, the percentage 
of pupils who pass a particular item in American history may fail to increase, 
or may even drop, in the higher of the two grades. The same objection ap- 
plies to the evaluation of total scores on such tests in terms of 
Tess criterion. 


Finally, we must bear in mind that achieveme 
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ure how well the individual has learned the content of a uniform course of 
instruction. Moreover, outside reading is influenced by a number of fortui- 
tous factors, such as the individual's interests, the availability of reading 
matter, the extent of his participation in other extracurricular activities, his 
home duties, and the like. 

Within a specific course of study, the individual may show superior 
mastery of the material taught in his own grade and yet be unable to score 
at a higher grade level. On the other hand, if the test is so constructed that 
the child who has an excellent grasp of the course content in his own grade 
obtains a higher “grade equivalent" score, then such a score is misleading. 
Under these conditions, for example, the child who excels in fifth-grade 
history may obtain a “seventh-grade equivalent" score. This score seems to 
indicate that he has mastered seventh-grade history, but it does not really 
mean that. In other words, the seventh-grade norms on such a test could be 
reached either by average mastery of seventh-grade content or by superior 
mastery of fifth-grade content. Scores obtained in this fashion are there- 
fore ambiguous. 

Moreover, the amount of improvement in a particular subject of instruc- 
tion may be much greater, for example, between grades 4 and 5 than be- 
tween grades 6 and 7. The fourth-grade child who is one grade accelerated 
on an achievement test in this field would thus excel the average more than 
the sixth-grade child who is accelerated by one grade. This difficulty is 
similar to that presented by MA units. But the use of a ratio correspond- 
ing to an IQ would be no solution, since the variations in grade units are 
more irregular, depending upon the peculiarities of the curriculum at dif- 
ferent grade levels. r , 

The picture is further complicated by the fact that, during any one grade, 
progress may be relatively slow in some fields of instruction and relatively 
rapid in others. The child whose achievement BRE scores indicate an accel- 
eration of two grades in reading and in, arithmetie may actually excel his 
classmates much more in arithmetic than in reading. This would be true if a 
larger proportion of individuals in that grade were accelerated by two grades 
in reading than in arithmetic. In that case; a two-grade acceleration in reading 
would not represent as much superiority as a two-grade acceleration in arith- 
metic. It is apparent that an achievement profile plotted in such grade units 
would be very misleading. . 

It should also be noted that grade norms tend to be incorrectly regarded as 
performance standards. A sixth-grade teacher, for example, may assume that 
all pupils in her class should fall at OF close to the sixth-grade norm in 
achievement tests. Such a misconception is certainly not surprising when 
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grade norms are used. Yet individual differences within any one grade are 
such that the range of achievement test scores will inevitably extend over 
several grades. 

To express achievement test scores in terms of educational age norms 
offers no solution to these difficulties. Educational ages are to be interpreted 
in the same manner as mental ages, except that the former are based upon 
educational achievement test scores, rather than upon intelligence test 
Scores. Educational age norms are usually found by taking the median 
score of all pupils of a given age, regardless of grade placement. Since most 
children advance one grade each year, age and grade norms will correspond 
fairly closely. Discrepancies between them will be partly a reflection of pro- 
motion policies and will virtually disappear if modal-age grade groups are 
used. Regardless of how they are computed, however, educational age norms 
are subject to the same drawbacks as grade norms. 

The computation of an "educational quotient" (EQ) by dividing educa- 
tional age (EA) by CA meets with the additional type of difficulty described 
in connection with the IQ. Thus it is likely that the relationship between 
changes in EA and CA is such as to produce unequal variability of the EO 
at different ages. No uniform interpretation could therefore be attached to a 
given EQ when obtained by children of different ages. 

For the majority of testing purposes, the most satisfactory norms for the 
evaluation of achievement test performance are those showing the individ- 
ual's position within his own grade level. Percentile-within-grade norms are 
being employed increasingly for this purpose. By means of such norms, the 
individual's percentile rank is determined in reference to a normative sam- 
ple of his own grade. Age and grade norms are still widely used, how- 
ever, because of their familiarity and their apparent ease of interpretation. 

A current practice followed in some of the more carefully constructed 
achievement tests involves the use of a single, composite distribution of stand- 
ard scores for all grades. One grade level is selected as a reference group, the 
average and SD of this group being used to define the point of origin of the 
scale and the size of its unit, respectively. The range is extended by supple- 
mentary scaling in other grades above and below this primary reference 
group. The single over-all distribution of standard scores thus obtained is 
then used in converting raw scores to standard-score equivalents for all 
individuals, regardless of age or grade. In plotting profiles and in recording 
the individual’s progress from year to year, such a uniform system of stand- 
ard scores is certainly preferable to age or grade norms. 

A further refinement of such uniform score scales is represented by the 
K-score proposed by Gardner (20, 21) and introduced in the 1953 edition 
of the Stanford Achievement Test (33). Although similar in principle to 
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normalized standard scores, K-scores are expressed in terms of a more gen- 
eralized type of curve that may exhibit varying degrees of skewness. It is 
believed that such curves permit a more accurate representation of the dis- 
tribution of academic talent within a particular school grade. 

For purposes of interpretation, however, it is customary to provide age, 
grade, or percentile-within-grade norms. which are expressed in terms of the 
over-all standard scores. These age, grade, or percentile norms are, of course, 
subject to all the usual limitations of such norms, regardless of the fact that 
they are found from standard scores rather than from raw scores. All the 
s to grade norms would still apply. When crude 


previously discussed objection 
posed upon the carefully developed scale of 


types of norms are superim 
standard scores, many of the advantages of the scale are lost. 

Despite the availability of supplementary, within-grade norms, the major 
emphasis in the interpretation of achievement test scores is placed upon the 
individual's progress along a single, composite scale. lt is interesting to note 
that the traditional procedures for scaling scores are such that in "aptitude" 
tests the individual is usually compared with a peer group of his own age, 
while in "achievement" tests he is commonly referred to a single broad dis- 
tribution within which he is expected to progress as he continues his school- 
ing. The result of these dissimilar approaches to scaling is to yield achieve- 
ment test scores that rise from year to year, while aptitude test scores tend 
ant. The deviation IO on an "intelligence" test, for example, 
will remain approximately the same when a given individual is retested an- 
nually. It will be recalled that such a deviation IO is none other than a stand- 
ard score. When the same individual is retested annually with achievement 
tests, however, his standard scores will show progressive "improvement," in 
contrast to the "constancy" of his IQ. i 

Such a traditional difference in the application of standard scores tends to 
perpetuate the myth that intelligence tests measure the individual's "innate, 
unchanging capacity." while achievement tests measure the cumulative and 
ever-changing effects of learning. The distinction is, of course, purely illusory. 
It would be quite feasible to reverse the relation. Intelligence tests could be 
so scaled that their deviation IO's rose each year, and achievement tests could 
be so scaled that their standard scores remained constant throughout all 
ages and grades. It is simply a question of what reference group is chosen. 


to remain const 
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toward broader content areas. More and more achievement batteries at all 
levels have been introducing comprehensive sections dealing with the hu- 
manities, social studies, or natural science. Concurrently, achievement tests 
have been moving toward the measurement of improvable intellectual skills 
or developed abilities. Early achievement tests dealt largely with the mastery 
of factual content. In the more recent batteries, factual items have been 
supplemented—and in some instances completely replaced—by items de- 
signed to assess critical thinking, the application of principles to the solution 
of new problems, or the development of habits and attitudes conducive to 
the appreciation of art and literature. 

A. characteristic feature of current achievement batteries is their emphasis 
on work-study skills. These include Such basic intellectual skills as reading 
comprehension, ability to express one's ideas 
as well as more specific skills needed in loca 
tion. The last are illustrated by tests on the u 
other reference materials, on ma 
and graphs. 


, and arithmetic computation, 
ting and interpreting informa- 
se of indexes, dictionaries, and 
P reading, and on the interpretation of tables 


Several circumstances have led to these changes in the nature of achieve- 
ment tests. In part, of course, what has been happening to achievement tests 
reflects underlying changes in curriculum and teaching methods over the 
intervening period. Courses, units, or projects that cut across traditional 
subject-matter areas have become a well-known feature of the educational 
scene. Equally familiar is the decline in drill and memorization in favor of 
procedures in which the learner plays a more active role. 

A second contributing factor was a growing dissatisfaction with the 
achievement tests themselves. Traditional achievement tests, especially when 
used in major selection or admission programs, were criticized on the 
grounds that they tended to impose rigid controls on teaching and encouraged 
cramming for facts. The College Entrance Examination Board (CEEB), for 
example, has felt some concern about the prevalence not only of coaching 
schools but also of coaching courses and sessions in the regular high schools. 
Although the intensive reviews and drill sessions jn preparation for such 
examinations probably have some educational value, it is generally conceded 
that the students’ time could be more profitably spent. If teachers or schools 
are judged in part on the basis of how many students pass such tests and are 
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(15). What finally emerged from several years of joint effort by subject- 
matter specialists and test technicians was a six-hour battery covering the 
humanities, social studies, and science. The tests called for both knowledge 
and intellectual skills, with major emphasis on the latter. 

From an operational standpoint, it was eventually decided not to incor- 
porate these particular tests in the CEEB testing program, chiefly because 
the predictive validity of the Scholastic Aptitude Test (SAT) in combination 
with existing Board achievement tests in specific fields proved to be as good 
as that of the TDA for some purposes and somewhat better for others (40). 
The flexibility provided by different combinations of achievement tests is an 
advantage in predicting performance in different curricula or subjects. More- 
over, the TDA are more time-consuming to prepare, administer, and score 
than Board tests already in use. It should also be borne in mind that the 
failure of TDA to surpass other available instruments as predictors of college 
performance does not preclude the possibility that TDA-type tests may serve 
better in other functions for which achievement tests are designed. 

The experience gained in constructing the TDA will undoubtedly influence 
the subsequent development of achievement tests for use in the CEEB pro- 
gram. It is anticipated that some of the most promising features of the TDA 
may be incorporated either in Board achievement tests or in the SAT. At 
a broader level, experimentation with the TDA tended to highlight and to 
define the concept of developed abilities, which underlies many current 
achievement batteries. Cutting across traditional achievement and intelli- 
gence tests, the TDA point up the basic similarity in the behavior meas- 
ured by the two types of tests. By the same token, they demonstrate that the 
abilities subsumed under “intelligence” are largely those that education un- 
dertakes to develop. It has long been known that intelligence tests cor- 
relate about as high with achievemet tests as any two intelligence tests cor- 
relate with each other (5). Perhaps the coming of a convenient term like 
“developed abilities” will succeed in breaking down te dichotomy. 

A third influence that has helped shape current achievement tests is to be 
found in certain major evaluation programs conducted in the Schools. Out- 
standing among these are the Eight-Year Study of the Progressive Education 
Association, completed in 1942 (1, 46), and the Cooperative Study of Eval- 
uation in General Education, completed in 1954 (12). The many ingenious 
and novel tests developed for use in these surveys include instruments de- 

1 sasure critical thinking, application of scientific principles to 
dba DE E ions interpretation of literature, and other broad educa- 
plas ae ieee impetus to the development of the new-type 


"SE VES, -develo À 
Neher, aes ovided by the publication in 1946 of the forty-fifth 


achievement tests was pr 


444 Differential Testing of Abilities 


Yearbook of the National Society for the Study of Education, entitled The 
Measurement of Understanding (38). Covering the appraisal of understand- 
ing in a variety of subject-matter areas at both elementary and high school 
levels, this Yearbook is an excellent source of sample items designed to test 
different educational objectives. 

From a different angle, further momentum was furnished by the testing 
program conducted under the auspices of the United States Armed Forces 
Institute to aid in evaluating the educational status of veterans. For this 
purpose, the Tests of General Educational Development (GED) were de- 
veloped at both high school and college levels (2, 55, 56). These tests were 
originally designed to help schools and colleges determine the amount of 
academic credit that should be granted students for their educational ex- 
periences while in military service. As a result of his performance on such 
tests, an adult who has been out of school for some time may be admitted 
to college without formal completion of a high school course, or he may be 
certified as having the equivalent of a high school diploma. At the high 
school level, the GED battery comprises tests of correctness and effectiveness 
of expression, general mathematical ability, and interpretation of reading 
materials in social studies, natural sciences, and literature, In these three 
major content areas, no factual knowledge is directly required, the student 
being tested only for his reading comprehension of passages chosen from 
textbooks and other appropriate sources. 

Another probable influence affecting achievement test development stems 
from recent interest in the identification of high-level talent in science and 
engineering. Such interest has focused attention on the desirability of meas- 
uring creativity, reasoning, and critical thinking in achievement tests as well 
as in aptitude tests. In fact, several of the aptitude tests discussed in Chapter 
15, such as the Watson-Glaser Critical Thinking Appr 
Guilford tests, could be classified in the 
recognition that creativity, 
lated, cultivated 


aisal and some of the 
present chapter as well. With the 
critical thinking, and similar talents can be stimu- 
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fact susceptible to educational development. As a result, achievement tests 
are appearing in many areas once reserved for aptitude tests. 


GENERAL ACHIEVEMENT BATTERIES 


A number of batteries have been developed for measuring the individual's 
general educational achievement in the areas most commonly covered by 
academic curricula. This type of test can be used from the primary grades 
to the adult level, although its major application has been in the elementary 
school. Most batteries provide individual profiles of subtest scores, in addition 
to a total score on the entire battery. An advantage of such batteries as 
against independently constructed achievement tests is that they may permit 
horizontal or vertical comparisons, Or both. Thus an individual's relative 
standing in different subject-matter areas or educational skills can be com- 
pared in terms of a uniform normative sample. Or the child's progress from 
grade to grade can be reported in terms of a single score scale. The test 
user should check whether a particular battery was so standardized as to 
yield either or both kinds of comparability. 

Among the most widely used series of achievement tests for the elementary 
school level are the Metropolitan Achievement Tests (14). In their 1959 
revision, these tests include five batteries, ranging from grades 1 to 9. Each 
battery is available in either three or four equivalent forms. Designed pri- 
power rather than speed. each battery requires from 
-a-half hours distributed over four or five testing 
any one battery are printed in a single booklet. At 


marily as measures of 
about two to four-and 


sessions. All tests within 1 
the upper levels, however. partial batteries are also available. 


The composition of the Metropolitan Achievement Tests is summarized in 


Table 23, which lists the tests included in each battery. It will be noted that 


the dur of tests yielding separate scores ranges from four in the Primary 
I Battery to thirteen in the Advanced Battery. TThe content of most of these 


tests is recognizable from the titles given in Table 23. Word Knowledge 
is a multiple-choice vocabulary test, hon even at the lowest level calls for 
ability to read words. Word Discrimination, found only at the three lowest 
levels, requires the discrimination of small differences in the appearance of 
words, an ability considered important in learning to read. Sample items 
from the Reading and the Arithmetic Concepts and Skills tests of the Pri- 
mary I Battery are reproduced in Figure 95. The Language Study Skills tests 
are concerned with the use of the dictionary and other reference sources. In 
the tests of Social Studies Study Skills, the pupil reads various kinds of maps, 
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tables, and graphs and draws conclusions from the data therein presented. 
Although containing a number of items that involve understanding and ap- 
plication of knowledge. the Metropolitan tests as a whole make fairly heavy 
demands upon specific factual information. This is especially true of the 
Science tests. 


TABLE 23. Metropolitan Achievement Tests 


Battery 
Test Prim. I Prim. lI Elem. | Intermed. Adv. 
(2nd half of (Grade2) (Grades (Grades (Grades 
Grade 1) 3-4) 5-6) 7-9) 

Word Knowledge e ° ° * 
Word Discrimination ° ° ° 
Reading o . . » 
Arithmetic: 

Concepts and Skills ° e 

Problem Solving and Concepts ° ° 2 

Computation ° e . 
Spelling e . ° e 
Language: 

Usage . . 

Punctuation and Capitalization . e . 

Parts of Speech and Grammar ° p 

Kinds of Sentences * 
Language Study Skills . . 
Social Studies Information ° . 
Social Studies Study Skills ° d 
Science e. . 


Raw scores on each test are first converted into normalized standard scores 


with a mean of 50 and an SD of 10. which provide horizontal comparability 
of all tests and all forms of a given battery, but not vertical comparability be- 
tween different batteries or grade levels. For most practical purposes, these 
standard Scores represent only an intermediate step in looking up the 
Stanines, percentile ranks, or grade equivalents for each test. The stanines 


are probably the most satisfactory type of score, since they represent equal 
units (see Ch. 4). It is in terms of stanines th 

plotted. Stanines and percentile r. 
half-year grade group. Gr. 
example, if a child's sco 
the average score obtain 


as nearly as possible a representative 
pulation. 
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Split-half reliability coefficients of the Metropolitan tests, computed within 
single-grade groups, are in the .80's and .90’s, except for the separate parts of 
the Language tests. These part scores, however, are combined in Bia of the 
normative evaluations. Content validity is based chiefly on "curricular re- 
search" involving systematic analysis of syllabi, textbooks, and published 
statements of educational goals, from which the test specifications were pre- 
pared. Further validation utilized item analyses conducted in sad 8 
tryouts of the experimental forms. In the development of the final forms 
items were selected in terms of difficulty, discriminative value against gite 


test scores, and grade differentiation. 


Test 3. Reading: Sentences | 
Put a cross in the little box beside the story that tells about the picture 


[JA family has come to visit at this house. 
O Mother opens the door for the company. 
C The little girl has come to visit her playmate. 


Test 4. Arithmetic Concepts and Skills 
Put a cross on the thing that is sold by the yard. 


ENT 


Fig. 95. Samp 
(Reproduced by pe 


An achievement test series that is outstanding in several respects is the 
Educational Progress (STEP), developed by the Co- 
operative Test Division of the Educational Testing Service (8). These tests 
are available in four levels, suitable for grades 4 to 6, 7 to 9, 10 to 12, and 13 
to 14. At each level, there are seven tests, including multiple-choice tests 
in Reading, Writing. Mathematics, Science, Social Studies, and Listening, as 
well as an Essay test. Two parallel forms of each objective test and four 
parallel forms of the Essay test are available for each level. For maximum 
flexibility of use, all seven tests at each level are published in separate book- 
lets and may be obtained individually. The Essay test requires thirty-five 


le Items from Metropolitan Achievement Tests, Primary I Battery 
rmission of World Book Company.) s 


Sequential Tests of 
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minutes. Each of the six objective tests requires seventy minutes and can be 
administered either in a single session or in two thirty-five minute sessions. 

Although the need for specific knowledge in particular fields was recog- 
nized in constructing STEP items, major emphasis was placed upon the 
application of learned skills to the solution of new problems: The tests € 
concerned more with the outcomes of school learning than with its content. 
Since teachers are likely to agree more closely on the objectives of instruction 
than on materials and methods, tests such as these, which concentrate. 5n 
objectives, are more widely applicable. They also encourage more flexibility 
with regard to the specific content taught. The STEP item writers have dis- 
played unusual ingenuity and skill in composing items that do in fact meas- 
ure many of the intellectual skills outlined 
Taxonomy of Educational Objectives. 

Two examples typical of items found in the tests for Sci 
and for Social Studies (grades 10-12) are re 
97. In the Science 


in the previously discussed 


ence (grades 4-6) 
produced in Figures 96 and 
and Mathematics tests, a problem situation is presented 
and is followed by a set of related multiple-choice questions. The situations 
are chosen so as to be as realistic as 
home, at camp, on the farm, etc. A 
especially in the Mathematics test 
hension. This is especially true at 
relevant verbal content is introd 
children. It is not surprising, 
the STEP Mathematics score 
the quantitative scores on 
With SCAT verbal scores are 
A major portion of the S 
munication skills. This is a 
test of writing ability, a reading comprehension test 
hension test. In each of the four forms 
assigned a different topic, 
liminary experimentation as 
classroom teacher, by means 


possible, dealing with events in the 
possible drawback of this type of item, 
, is its heavy loading with verbal compre- 
the lower levels, where a good deal of ir- 
uced to make each “story” appealing to 
therefore, to find that at the fourth-grade level 
correlates more highly with the verbal than with 
SCAT. Even at higher levels, the correlations 
high enough to denote considerable overlap. 

TEP batteries is devoted to the testing of com- 
ccomplished through an essay test, an objective 
, and a listening compre- 
of the Essay test, the students are 
these topics having been chosen through pre- 
the most effective, The essays are scored by the 


of a product scale technique. For each topic at 
each level, five sample essays are provided, together with the ratings as- 


signed to each by a group of experienced teachers. With these samples as 4 
guide, the scorer rates each essay on a 7-point scale, giving prescribed 
weights to thought quality, style, and mechanics of expression. It might 
be noted that product scales were among the earliest standardized instru- 
ments introduced in educational measurement. They are also used in certain 
art tests, such as the Horn Art Aptitude Inventory described in Chapter 15- 
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Situation: Tom wanted to learn which of three types of 
soil—clay, sand, or loam—would be best for growing lima 
beans. He found three flowerpots, put a different type of 
soil in each pot, and planted lima beans in each. He placed 
them side by side on the window sill and gave each pot the 
same amount of water. 


Be 


oes 


LOAM CLAY SAND 


The lima beans grew best in the loam. Why did Mr. 
Jackson say Tom’s experiment was NOT a good experi- 
ment and did NOT prove that loam was the best soil for 


plant growth? 


23 A The plants in one pot got more sunlight than the 
plants in the other pots. 
B The amount of soil in each pot was not the same. 
C One pot should have been placed in the dark. 
D Tom should have used three kinds of seeds. 


m from STEP Science Test for Grades 4 to 6. (Reproduced by 


Fi ample Ite I 3 
Fie 26o Sanpa st Division, Educational Testing Service.) 


permission of Cooperative Te 
Scorer unreliability is a common problem in such product scales and has 
proved to be a persistent weakness of essay tests in general. Preliminary 
evaluation of scorer reliability for the STEP Essay tests yielded correlations 
of .50 to .77 between the ratings assigned by different readers to the same 
papers. The ratings given to each STEP Essay can be further evaluated in 


terms of percentile norms established on a national sample of 5000 students 


in grades 4 to 14. 


In the objective Writing tests, the students are given a wide variety of 


written materials ranging from letters and questionnaire replies to editorials 
and stories. Most of these materials e actua Specimens of student writing. 
Each passage is followed by multiple-choice items covering specific ways in 
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The students are provided with monthly temperature and rainfall 
charts for four places, as shown below. 


J 
90 August 
J J J ,August 
5 M40 - 
a sp M December 
^ J 
CITY II 
70 
A 

M 
$ 60 
$ P CITY Ill 
& [jb 
& 50 x 
t- 
5 
$ 
5 ie January 
o 
a 


3 


4 5 6 7 


Inches of Rainfall 
They are asked such questions as: 


15 In which city would one need the greatest variety of weights of clothing? 
AI BI CM DIV 

18 Which of these cities are north of the equator? 
E I and Ill only G All of the cities 

F Il and IV only H None of the cities 


8 9 10 11 
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which the writing could be improved with regard to mechanics of expression 
as well as organization and effectiveness. It is interesting to note that scores 
on the objective Writing tests correlate from .61 to .70 with ratings on the 
Essay tests, while different forms of the Essay test correlate from .64 to .73 
with each other. In other words, a student's performance on one essay 
can be predicted about as well from his score on the objective writing test as 
from another essay. 

In the Reading Comprehension tests, short passages from a variety of con- 
tent areas are followed by questions designed to test such skills as simple 
comprehension, interpretation, insight into the writer's motives, and critical 
evaluation. The measurement of Listening Comprehension, or "auding," is a 
relatively new development in achievement testing. The ability to under- 
stand, interpret, and critically evaluate what one hears is coming to be rec- 
ognized as an important educational goal. In the STEP Listening tests, the 
given passages are read by the classroom teacher. The questions and response 
options are also read by the teacher, although the students have a copy of the 
response options before them. A typical passage for grades 7-9, together with 
three sample items, can be seen in Figure 98. The passages sample many 
types of listening, including directions and simple explanations, exposition, 
narration, argument and persuasion, and aesthetic material. Presentation by 
classroom teachers, chosen in favor of recordings for practical reasons, 
nevertheless introduces an uncontrolled factor. At the upper levels, a further 
limitation arises from the brevity of the passages. The high school senior and 
college student—as well as the adult outside of school—must often listen to 
lectures considerably longer than the one- to four-minute passages of this 
test. ? 
Raw scores on each STEP test are first converted into a three-digit score 
scale which, unlike the Metropolitan standard score scale, was designed for 
vertical rather than horizontal comparability. Thus performance on any one 
test, such as Mathematics, is expressed in terms of a single scale for all 
grades, but these scores are not directly comparable from one test to an- 
other. By reference to appropriate tables for each grade level, STEP Scores 
can be further transmuted into percentiles. Rather than yielding a single 
percentile rank, however, the scores are expressed in the form of a percentile 
band for each individual. As in the case of SCAT (see Ch. 9), these percentile 
bands cover a distance of approximately one standard error of measurement 
on either side of the corresponding percentile. The chance are thus roughly 
2:1 that the student's true position falls within the given band. The STEP 
Student Profile for the six objective tests is similar to the SCAT profile il- 
lustrated in Figure 37 (Ch. 9). For both STEP and SCAT, there is also a 
simplified Student Report form, used in interpreting scores to the students 
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The examiner reads: 

Here is the fourth selection. It is a speech 

by a student running for school office. 
A students, B students, C students, D stu- 
dents, and my friends! As you know, | am 
running for the office of President of the 
Student Council. I'd like to tell you what I'll 
do if I'm elected. In the first place, | think 
several students ought to sit in on teachers" 
meetings. They settle too many things for 
us. | don't think that the teachers always 
know what's best for us. 

In the second place, I'd like to see our 
Student Council do something. Take the 
business of the candy machine, for instance. 
Just because a couple of doctors and den- 
tists don't like it doesn't mean we shouldn't 
have one. | think they are wrong. I think we 
should have one. Candy is good for us. It 
gives us energy, and I, for one, don't think 
it hurts either your teeth or your appetite. 
And if it does, so what? You save the lunch 
money and can go out ona date. 

Last, you know that my opponents—and 
you'll hear from them in a minute—are two 
girls. Now, everybody says girls are smarter 
than boys. That might be true—but just be- 
cause they're smarter doesn't mean they'll 
make better officers. In fact, I think girls are 


too smart and can't always get along with 
people because of that. Maybe we need 
somebody not so smart, but that can get 


along. That's me, fellow student—vote for 
me! 


19 The speaker's principal objection to girls 
as school officers evidently is that they 

A talk too much 

B support the teacher's point-of-view 

C are too smart to get along with people 

D don't want a candy machine 


22 When the speaker used the word “op- 
ponents," he meant 

E students from other schools 

F students running against him 

G the teachers 

H doctors and dentists 


23 Judging from his comments, how does 
the speaker feel about the opinions of 
experts? 

A He pretends that the experts agree with 
him. 

B He does not respect the experts if he dis- 
agrees with them, 

C He pretends to treat the experts with re- 
spect. 

D He follows expert advice unless he can 
Prove it is wrong. 


the i On given on the Student Report to 
8 on different tests may be compared 


y reliability, Although only Kuder-Richard- 
the time of publication, data on correlation 
ill undoubtedly be available later. The de- 


son reliabilities were reported at 
of parallel forms and stability w 
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EXAMPLE 


o 
t 


haa all AGA al AA dN TOT 


Mathematics 


o 
N 
w 
a 


Science 
== 


Luts TH 


0 1 2 3 4 5 6 


CEEEFES 


Lassies 


1 2 3 4 5 6 


8 9. 10 


Social Studies 


o 


The shaded areas for Mathematics and Social Studies overlap; 
there is no important difference in standings on these two tests. 
The same is true of Mathematics and Science. However, the shaded 
areas for Science and Social Studies do not overlap. The student 
is higher in Social Studies than in Science obility, as measured by 


these tests. 


atory Example from STEP Student Report. (Reproduced by permis- 
Test Division, Educational Testing Service.) 


Fig. 99. Explan 
sion of Cooperative 


velopment of STEP represents content validation at its best. Committees of 
outstanding educators, representing all levels from the elementary school to 
college and chosen in consultation with national professional organizations, 


participated with ETS test-construction specialists both in drawing up test 
specifications and in preparing and reviewing items. Statistical analyses of 
preliminary forms included the usual determination of difficulty, discrimina- 
tive power, and grade progress for individual iems. The extent to which it 
proved feasible to construct objective items measuring understanding rather 
than mere recall of information represents the major contribution of these 
tests. ; 
The Metropolitan Achievement Tests and STEP have been chosen to il- 
lustrate the scope, nature, similarities, and differences of some of the best 
current achievement batteries. More could be said about both the strengths 
and weaknesses of either test series. To round out the picture, the student is 
urged to consult the test manuals and the Mental Measurements Yearbooks. 
Several other well-known general achievement batteries may be men- 
tioned. One of the earliest standardized educational achievement tests is the 
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Stanford Achievement Test (33), covering grades 2 to 9. First published in 
1923, this battery was revised in 1929, 1940, and 1953. Another revision 
is in progress and will probably be completed in 1963. Other batteries de- 
signed for the elementary and junior high school levels include the Iowa 
Tests of Basic Skills (37) and the SRA Achievement Series (49). The 
California Achievement Tests (50) extend from the first grade through the 
college sophomore year. Batteries designed specifically for the high school 
level include the Iowa Tests of Educational Development (36) and the Es- 
sential High School Content Battery (28). 

The Cooperative General Achievement Tests (6) were developed for use 
with high school seniors and college freshmen, while the Cooperative Gen- 
eral Culture Test (7) is intended exclusively for college use, especially at the 
sophomore level. Mention should also be made of the Area Tests (24) 
of the Graduate Record Examinations (GRE). Constructed for use from the 
sophomore year of college to graduate school, the Area Tests are restricted to 
administration in the GRE Institutional Testing Program. Covering the broad 
areas of social science, humanities, and natural science as taught at the college 
level, these tests have succeeded exceptionally well in providing items that 
assess understanding and the attainment of other complex educational ob- 
jectives. Longitudinal studies of college students taking alternate forms of the 


Area Tests in their freshman, Sophomore, and senior years revealed signifi- 
cant gains in mean scores (34). 
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CHAPTER ]7 


Achievement Tests: Special Areas 


The preceding chapter dealt with the construction and uses of achievement 
tests as a whole, as well as with currently available general achievement bat- 
teries. In this chapter, we shall consider typical examples of achievement 
tests designed for more Specific purposes. Principal among these are readi- 
ness and diagnostic tests in the areas of reading and mathematical skills, 
which will be discussed in the first two sections. Educational achievement 
tests in separate content areas will be considered in the third section. The 
next two sections will be concerned with vocational achievement tests and 
with the use of achievement tests in admitting students to professional 
Schools and in selecting professional personnel. In the final section, the re- 
lation of achievement tests to aptitude tests will be re-examined and the place 
of achievement tests in the continuum of ability testing will be reviewed. 


READINESS TESTS 


Readiness or prognostic tests are essentially aptitude tests, since their ob- 
ject is to predict how well the individual will profit from a subsequent course 


of training. Since they are generally employed in an educational setting. 
however, they can be more convenie 


themselves be used a 
e next more advanced 
» however, have been specifically designed 
and mathematical skills- 


have much in common with general intelli 
discussed in Chapter 9, In the readiness 
placed on those abilities found to be most 
attention also being given to the prerequ 
458 


Sence tests for the primary grades. 
tests, however, special emphasis 15 
Important in learning to read, some 
isites of numerical thinking and to 
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the sensorimotor control required in learning to write. Among the specific 
functions covered are visual and auditory discrimination, motor control. 
verbal comprehension, vocabulary, quantitative concepts, and general in- 
formation. 

As an example of readiness tests for the first grade, we may examine the 
Metropolitan Readiness Tests (25). This battery, a revision of which is in 
progress, currently includes the following six subtests: 


= 


Word Meaning: In each row of four pictures, the subject selects the one 
that illustrates the word the examiner names (cf. Fig. 100). 


N 


Sentences: This test is similar to Test 1, except that phrases and sentences are 
used instead of single words. 


3. Information: The subject again marks the one picture in each row of four 
that corresponds to the examiner’s oral description, but the objects are now 
described in terms of use or function. E.g., “mark the one you take pictures 


with.” 


4. Matching: This test requires the recognition of similarities and differences 
in visual material, including pictures of objects, geometric forms, numbers, 


letters, and words (cf. Fig. 100). 


5. Numbers: Covering a wide variety of quantitative concepts and simple nu- 
merical operations, this test resembles closely the quantitative subtests in- 


cluded in intelligence tests for the primary grades. 


6. Copying: The subject copies simple geometric forms as well as numbers or 
letters. This test is related to both physical development and intellectual 
maturity in young Children. It also reveals the tendency toward reversals in 


drawing and writing shown by some children. 


The Metropolitan Readiness Tests are available in two equivalent forms. 
Percentile norms are provided for reading readiness (tests 1-4), number 
readiness (test 5), and total readiness for first grade school work (tests 1-6). 
These norms were established on a nationwide sample of more than 15,000 
white public school children tested during the first month of the first grade. 
As a supplementary measure, the test manual recommends the use of a 
Draw-a-Man Test similar to that standardized by Goodenough and discussed 
in Chapter 10. The scoring, however, involves the classification of each 
drawing as a whole into one of five broad categories, rather than the assign- 
ment of specific points. Median reliability coefficients, based on retests with 
parallel forms over intervals of a few days, are given as .83 for reading readi- 
ness, .84 for number readiness, and .89 for total readiness scores. The re- 
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Test 1. Word Meaning. In the first row, the subject marks the baby; in the second 
row, the house. 


TaN 


Test 4. Matching. In each row, 
the one in the circular frame. 


the subject marks the picture which is identical to 


Fig. 100. Sam 


ple Items from the Metropolitan Readi ss Tests. ight by World 
Book Company.) ness Tests. (Copyrig 


size, weight, time, and distance, 
and terms, the notion of fractio 
tion. Two sample items from th 
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The smallest dog belongs to Jimmie. Mark the picture of the dog that 
belongs to Jimmie. 


A 


Make a cross on that number which tells how many hands a man has. 


NE E TES 


Fig. 101. Sample items from New York Test of Arithmetical Meanings, Level One. 
(Reproduced by permission of World Book Company.) 


centile norms are given for the beginning of the second and the beginning of 
the third grades, based on approximately 17,000 children tested at each of 
these two levels in a nationwide sample. Median split-half reliabilities are 
in the .80's. Content validity was established through curricular analysis of 
the arithmetic taught in the first two grades. 

Prognostic mathematics tests at a more advanced level are illustrated by 
the Orleans Algebra Prognosis Test (41) and the Orleans Geometry Prognosis 
Test (42). In both of these tests, the student is provided with simple material 
to learn from algebra or geometry, and is immediately tested on what he has 
learned. These tests are thus worksamples, in which the student's subsequent 
Course learning is predicted from his performance in the sample learning 
tasks. Other prognostic tests in mathematics cover a combination of prerequi- 
site arithmetic skills and new learning. Some contain material similar to that 
found in the numerical subtests of intelligence tests, such as number series 
Completions. All such prognostic tests are normally validated against sub- 
Sequent course grades and terminal achievement test scores. 

Still another type of readiness test is illustrated by the Modern Language 
Aptitude Test (12). Designed to assess the capacity of an English-speaking 
student for learning any foreign language, this test utilizes both paper- 
and-pencil and tape-recorded materials. It is suitable for high school, college, 
and adult groups. Two of its subtests require the learning of orally presented 
numbers and visually presented words in an artificial language. The other 
three parts test the subjects sensitivity to English grammatical structure, 
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as well as certain word recognition skills with visual and auditory materials. 
The test may also be administered in a shorter form not requiring a tape re- 
corder. Data on predictive validity in college and high school groups appear 
promising. In an earlier experimental version, this test proved especially 
effective in predicting success in intensive language training courses con- 
ducted by the Foreign Service Institute of the Department of State, the Air 
Force, and the Army Language School. 


DIAGNOSTIC TESTS 


In the measurement of both readin 
distinction is made between surve 
the general level of the individu 


g and mathematical skills, a further 
y and diagnostic tests. Survey tests indicate 
al’s achievement in reading or arithmetic. 
For this purpose, they usually provide a Single composite score. Among 
the best examples of such Survey tests are the reading and arithmetic sub- 
tests of the general achievement batteries discussed in Chapter 16. Diagnos- 
tic tests, on the other hand, are designed to analyze the individual’s per 
lormance and provide information on the causes of difficulty. Such tests 
typically yield several scores. 4 

Diagnostic tests in reading vary widely in the thoroughness of analysis 
they permit and in the s 
tests yielding two or thr 
vey function, to intensi 


, such as tachistoscopes for controlling rate of 
exposure of printed material, and ophthalmographs for photographing te 
subject's eye movements while he reads, A few examples will be considered 


ns of available reading tests can be found m 


Silent Reading Tests (23). 
grades 4 to 8 and an advan 


the use of a simple index, A 


Achievement Tests: Special Areas 463 


given question might be found in an index. It likewise includes a poetry 
comprehension test that uses a technique similar to that of the "directed 
reading" test. 

Another example in the same category is the Nelson-Denny Reading Test 
(39), which was revised in 1960. This test is designed for use with high 
School, college, and adult groups. Requiring a total working time of 30 
minutes, it yields separate scores in vocabulary, reading comprehension, and 
reading rate. Equivalent-form reliabilities of the three part-scores range from 
-81 to .93. Norms are based on large, representative, nationwide samples of 
high school and college populations. 

A more intensive approach is illustrated by the series of tests developed by 
the Committee on Diagnostic Reading Tests (58, 59). Applicable from 
grades 7 to 13, these tests include a Survey Section for screening purposes 
and a Diagnostic Battery. All tests are designed for group administration. 
with the exception of the oral reading test of the Diagnostic Battery. The 
Survey Section, which can be given within a single class period, yields 
Scores in rate of reading story-type material with satisfactory comprehension, 
general vocabulary, and comprehension of textbook-type material. 

The Diagnostic Battery was likewise designed for classroom administra- 
tion, although requiring more time than the Survey Section. However, the 
entire battery need not be given to each individual. Those areas are chosen 
in which the individual shows greatest difficulty, as determined by the results 
of the Survey tests. The Diagnostic tests are available in the same three areas 
Covered by the Survey Section, and in one additional. area, that of word 
recognition skills. Specifically, the tests in the Diagnostic Battery comprise: 
Vocabulary, which covers technical vocabulary in grammar, literature, sci- 
ence, social studies, and mathematics; comprehension of textbook material, 
both when the individual reads it himself and when it is read to him; rates of 
reading, including rate of reading different types of material, as well as 
flexibility of rate when reading for different objectives; and word attack, both 
Oral and silent. The last-named test utilizes a variety of procedures to an- 
alyze the individual's responses both to the meanings and to the sounds of 
words. It also provides a checklist for qualitative observations. A bulletin 
designed for teachers contains many suggestions regarding the use of test 
results in planning remedial instruction. NUT 

A typical battery for intensive individual testing is the Durrell Analysis of 
Reading Difficulty (17), designed for grades 1 to 6. The Durrell tests utilize 
series of paragraphs graded in difficulty, a set of cards, and a simple tachisto- 
scope. Scores are provided for rate and comprehension of oral and silent 
reading, listening comprehension, rapid word recognition, and word analysis, 
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Supplementary tests of written spelling and speed: of handwriting are n 
cluded. For non-readers, there are measures of visual memory for wor 
forms, auditory analysis of word elements, letter recognition, rate of learn- 
ing words, and listening comprehension. All of the comprehension tests ict 
ally require simple recall of details read or heard and do not call for pen 
understanding. One of the chief contributions of this battery is a. checklis 
based on reading errors identified in a survey of 4000 children. The battery 
is better suited for qualitative than for quantitative analysis of performance. 
Suggestions for remedial teaching are also provided in the manual. l 

Among the most common weaknesses of diagnostic reading tests are in- 
adequate reliabilities coupled with high intercorrelations of the subtests from 
which separate scores are derived. Especially in some of the shorter nee 
tests, these two conditions reduce the diagnostic or differential effectiveness 2 
the tests. Several tests have also been criticized because of the superficiality 
of understanding required by the reading comprehension subtests. ve 
measurement of rate of reading likewise presents special problems. Rate T 
reading depends upon such factors as the difficulty of the material and the 
purpose for which it is being read. Any individual may thus have, no vi 
but many reading rates. This is more likely to be true of the experience 
reader, who adjusts his rate to the nature of the task, Furthermore, in ei 
one reading test, the response set established by the directions may ni 
widely from one person to another. For example, when told to read a n. 
sage carefully so as to be able to answer questions about it later, some su , 
jects may skim the material rapidly, while others will try to understand ee 
memorize specific details. In the previously mentioned Diagnostic aie 
Tests, special efforts were made to tackle this problem by providing severa 
different measures of reading rate. The effectiveness of this solution, how- 
ever, has not been objectively demonstrated. 

In the area of mathematical skills, 
tic batteries are the Compass Diagn 
Diagnostic Test for Fundamental 
former are a series of group tests su 
prise 20 tests, each requiring from 


two of the most comprehensive diagnos- 
Ostic Tests in Arithmetic (46) and the 
Processes in Arithmetic CO, 21). The 
itable for use in grades 2 to 8. They com 
18 to 60 minutes and covering different 
types of arithmetic operations or problems. The Broups of items within each 
test permit a further detailed analysis of achievement. 

The Diagnostic Test for Fundamental Processes in Arithmetic, prepared 
by Buswell and John (10, 11), requires individual administration, since each 
problem is solved orally by the subject. It is thus possible to observe the 
work methods employed by the child in carrying out the different operations: 
Errors as well as undesirable work habits are recorded on a checklist, which 
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includes the most common difficulties encountered in arithmetic. The prob- 
lems were specially chosen so as to elicit such difficulties when present. 
This test is also applicable in grades 2 to 8. There are no time limits, no 
norms, and no total score, the problems and checklist being designed for 
qualitative rather than quantitative analysis of arithmetic performance. 

In connection with the use of diagnostic tests, one point deserves em- 
phasis. The diagnosis of reading and arithmetic difficulties and the sub- 
sequent program of remedial teaching are the proper functions of a trained 
specialist. No battery of diagnostic tests could suffice for this purpose. The 
diagnosis and treatment of severe reading disabilities require a thorough 
clinical case study, including supplementary information on sensory capaci- 
ties and motor development, medical and health history, complete educa- 
tional history, data on home and family background, and a thorough in- 
vestigation of possible emotional difficulties. In some cases, serious reading 
retardation proves to be only one symptom of a more basic personality mal- 
adjustment. Although survey and group diagnostic tests may serve to identify 
individuals in need of further attention, the diagnosis and therapy of reading 
disabilities often represent a problem for the clinician. This may also be 


true in cases of severe arithmetic disabilities. 


EDUCATIONAL ACHIEVEMENT TESTS IN 
SEPARATE CONTENT AREAS 


Standardized achievement tests are available for nearly every field of 
instruction. In the elementary school, with its relatively uniform curriculum, 
general achievement batteries serve the majority of testing purposes. For 
more intensive analyses of educational disabilities, these batteries may be 
supplemented with the diagnostic tests discussed in the preceding section. At 
the high school and college levels, specialized tests covering particular courses 
of study are more common. Descriptions and evaluations of available tests in 
each field can be found in the Mental Measurements Yearbooks, as well as 
in texts dealing specifically with educational measurement (e.g., 22, 29, 40, 
45). Books are also available that provide ists of achievement tests in cer- 
tain special areas, such as business education (24), home economics (5), 
and physical education (13). ! y ; 

Of particular interest are the coordinated series of achievement tests for 
different courses of study- Three well-known examples of such series are the 
tests administered in the annual testing program of the College Entrance 
Examination Board, the achievement tests prepared by the Cooperative Test 
Division of the Educational Testing Service, and the Evaluation and Adjust- 
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ment Series published by the World Book Company. A noteworthy oe 
of these coordinated series is their provision of a single system of comparable 
norms for all tests. It is thus possible to make direct comparisons among 
scores obtained in different subject-matter areas. 

Unlike the achievement batteries discussed in Chapter 16, however, these 
coordinated test series cannot be standardized on a single normative popu- 
lation. Moreover, it is likely that the normative samples available for the 
various academic subjects differ appreciably in general scholastic sane 
For example, students who have completed two years of Latin are p 
a more highly selected group than those who have completed two years : 
Spanish. And those taking an examination in advanced mathematics él 
probably more highly selected than those taking an examination in ia 
can history. If scores on all these tests were to be evaluated with referen : 
to the means of their respective normative samples, an individual might, ad 
instance, appear to be more proficient in Spanish than in Latin simply be 
cause he was compared with a poorer normative sample in the former case. ^ 

The three series cited above employ standard score scales which are oe 
justed for the differences among the normative samples utilized for manne 
In the College Board tests, the scores on the Scholastic Aptitude Test (S M 
provide the basis for making such adjustments (18). All raw scores of 
achievement tests are converted into a standard score scale having a mest of 
500 and an SD of 100 in the fixed reference group. This group consists 


3 the 
the 10,651 subjects who took the SAT in 1941. The scores obtained by 
candidates taking any achievement 


; are “are. eX” 
test during subsequent years ar 
pressed with reference to this 


j x ount 
group. The adjustment takes into acc 


in 
any differences between the present group and the fixed reference group 
the distribution of SAT scores. 


The College Board tests are restr 
member colleges. The Cooperative 
are available for general distributio. 
periodically. Tests have been develo 


1 TR t of 
icted to institutional use on the r o 

; and. 
achievement tests, on the other wer 
n. New forms of these tests are iss 


3 high 
ped for the most commonly taught cd 
K ` ` x a 1 
school courses in English, foreign languages, natural sciences, mathematics» 


history, and other social studies. The Scaled Scores used in Coopt 
achievement tests are normalized standard scores, so adjusted that a score ii 
50 represents the average score that would be expected if an nc 
group of students took the particular course, In the original derivation re 
these Scaled Scores, students of “average ability” were chosen on the peak 
of grade placement for age, as well as scores on an intelligence test and ie 
general achievement battery. It should be noted, however, that beginning Í 
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1960 new tests added to this series are no longer scored in terms of a uniform 
normative scale. 

The Evaluation and Adjustment Series (16). published by the World Book 
Company, covers a wide variety of high school courses. Some of the tests in- 
cluded in this series pertain to areas not ordinarily measured by traditional 
achievement tests, such as listening comprehension and study skills. Others 
deal with the traditional fields of study, such as algebra, geometry, biology, 
physics, American history. world history, and literature. As in the other test 
Series considered above, the raw scores on all these tests are converted into 
à single scale of standard scores. The normative samples for each test were 
compared on the basis of their deviation IO's on the Terman-McNemar Test 
of Mental Ability. The standard scores on cach test are expressed in the same 
units as the Terman-McNemar IQ’s. Thus in this scale, a score of 100 in- 
dicates "average performance." The reference group in terms of which such 
average performance is defined is the standardization sample of the Terman- 
McNemar test. 

The availability of comparable achievement test scores for different fields 
is desirable for many purposes. It is also clear that some adjustment must be 
made to allow for the differences among the examinee populations in dif- 
ferent fields. A word of caution is in order, however, regarding the interpreta- 
tion of the various systems of scaled scores described above. It cannot be 
assumed that the relative status of different groups on a highly verbal scholas- 
tic aptitude or general intelligence test is necessarily the same as their relative 
status on other tests. For example, if the candidates taking a solid geometry 
test and those taking a Latin test were to obtain the Same distribution of in- 
telligence test scores, they might still differ significantly in their aptitudes for 
Latin and for geometry. In other words, we cannot assume that the “Latin 
d ave obtained the same scores in solid geometry as the pres- 


Sample" would h 
pursued the same courses. 


; ar ad 
ent "solid geometry sample," if both ha sued t 
t be m jd da soils is not implied by the use of the above 


Systems of scaled scores. These systems merely express all scores in terms of 
a fixed standard that can be precisely and operationally defined. But the test 
user with only a superficial knowledge of how the scores were serived could 
easily be misled into unwarranted interpretations. It must be borne in mind 


that the procedures followed in developing these scoring systems provide 
at the p But they do not necessarily yield the identical 


Co ility of a sort. ; 
s aei J ad obtained if all students had been enrolled in the same 
iorms that wou en given all the tests in each series. 


co nd had be : : — 
rip of Se where the type of achievement test discussed in this sec- 
e may inquir 
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tion fits into the total testing picture. Such tests are obviously well suited for 
use as end-of-course examinations. But it is likely that they may continue to 


serve other functions also. In comparison with the broader tests of educa- 


tional development discussed in Chapter 16, traditional achievement tests 


which are more closely linked to Specific courses measure more nearly dis- 
tinct skills and knowledge. For this reason, they are likely to yield lower 
correlations with intelligence tests than have been found for broad achieve- 
ment tests. If combined with intelligence tests, therefore, the specialized 
achievement tests will contribute more unique, non-overlapping variance and 
may permit better prediction of subsequent outcomes. It will be recalled that 
it was largely for this reason that the College Board decided to retain the 


combination of SAT and special achievement tests in preference to the newer 
experimental Tests of Developed Abilities. 


VOCATIONAL ACHIEVEMENT TESTS 


So far we have been considering only the uses of achievement tests in edu- 
cation. But we must not lose sight of the fact that a large number of achieve- 
ment tests are utilized for selection and classification purposes in industry. 
government, and the armed services. When designed for industrial usc. 
they are commonly designated as trade tests. Many vocational achievement 
tests are custom-made for specific purposes and are not available for general 
distribution; ute a major example. and 
in industry and in the different 


Civil service examinations constit 
other illustrations can readily be found 
branches of the armed forces. 


Vocational achievement tests utilize a variety of testing media. The test 
content may be entirely verbal, or it may involve the use of diagrammatic O" 
pictorial material. Questions may be presented orally or in writing. For many 
testing purposes, paper-and-pencil content may be replaced by manual or 
other performance tasks to be executed by the subject. 


ples are sometimes presented in the 


tests. An illustration of 2 


d in Figure 102. 
Miniature tests may offer a number of 


P advantages, such as the use of less 
cumbersome and more easily duplicated equipment greater condensation 
of important work processes within a short testing period, and elimination of 
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risk. On the other hand, the reactions called for in miniature tests may be 
quite unlike those required on the job. despite superficial resemblance of the 
operations. For example, a different set of movements may be involved, as 
when hand movements are substituted for arm movements. The miniature 
task may also represent a highly artificial situation and may arouse different 
attitudes and emotional reactions than the job itself. When miniature work- 
samples are employed, it is especially important to check their empirical 
validity by careful comparison of test scores with on-the-job performance. 


ress Test. (From Tiffin and Greenly, 57, p. 451; repro- 


i inia Punch P. in 
Fig. 102. Miniature Psychological Association.) 


duced by permission of American 

The scoring ot worksamples may be based on either the process, the prod- 
uct, or both, The nature of the task may determine which aspect is scored. 
For example, piloting a plane, driving a car, or singing an aria in a vocal 
audition must be appraised in terms of the process. On the other hand, the 
ability to prepare effective advertising copy would usually be evaluated in 
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terms of the end product, the process being of relatively little interest. 
many tasks, both process and product are amenable to observation, but the 
product lends itself more readily to objective scoring. 
Process, or performance, scoring may be facilitated and standardized by 
the use of checklists indicating the points to observe and the relative STORES 
tance of each. Such checklists are commonly employed in administering 
road tests for the driver's license, in rating the performance of student pilots, 
and in many similar types of activities. Motion-picture records may be helpful 
in some situations. For example, in evaluating the performance of flexible 
gunners in bomber planes, a gun camera was used by the Air Force (56, pp: 
40-41). This camera was attached to combat-like equipment employed in 


training flights. A permanent record was thus obtained of the points at which 
the gun was aimed. 


Products of worksample tests can often b 


records. In a typewriting test, for example, the total number of errors in the 
typewritten copy may be readily counted. Patterns and x 
to determine whether a mechanical product falls within specified tolerance 
limits, or how far it deviates from a perfect specimen. Certain types of prod- 
ucts, however, require qualitative evaluation by an expert. Rating scales 
and checklists may be used as aids in such judgments. In other cases, product 
scales similar to those described for rating drawings or essays are employed. 


and the product is matched with the scale specimen it resembles most closely 
and is rated accordingly. 


Among the best-known vocational 
eral use are those designed for clerica 
raphy, and bookkeeping (cf. 7), 
with English usage and general bu 
of the primary skills under con 
ploy phonographic recording to 
dictation. A typical illustration i 
ficiency Test (48). This test req 
for five letters of increasing lengt 
rates of speed. The notes 
letters. A somewhat different 
Typing Adaptability Test (60 
copy of a letter containing longhand corrections, c 
dates and costs on a Printed form, 
and types them in alphabetical o 

Another technique for a 
volves the use of oral tra 


€ scored in terms of objective 


gauges can be applied 
8 


achievement tests available for gen- 
l jobs, especially typewriting, stenog- 
Some of these tests include parts dealing 
siness information, together with measures 
sideration. Stenographic tests usually em- 
assure uniformit 


s the Seashore-Bennett Stenographic Pro- 


ake stenographic notes 
and dictated at increasing 


). In this test, the examinee types a corrected 


H » . ing 
Opies material involving 


and rearranges names and addresses 
rder. 
Ppraising vocational training and experience in- 
de tests, 


These tests Consist of short series of 
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questions about specialized trade knowledge. The items of information are so 
chosen as to be fairly easy for anyone who has actually worked in a particu- 
lar type of job, but rarely familiar to other persons. Such questions are often 
used as interview aids by placement counselors in employment offices. They 
were also extensively utilized for the rapid classification of military personnel 
in both World Wars. 

Drawing upon oral trade tests previously developed for both military 
and civilian purposes, the United States Employment Service carried out a 
thorough restandardization and extension of this technique (cf. 51, Ch. 3 and 
pp. 156-162). In this program, oral trade questions were formulated for a 
total of 126 jobs. Each set consists of 15 questions, parallel forms also be- 
ing available for many of the jobs. Parallel-form reliability for such jobs 
ranged from .79 to .93. A sample item from the bricklayer's test is given 


below (51, p. 45): 


Question: What do you mean by building up a lead (/eed)? 
Answer : Building up a section (corner) of wall. 


In the development of these oral trade tests, questions were first formulated 
from information gathered through direct observation of jobs and consulta- 
tion with foremen and highly skilled workers in each field. Preliminary try- 
outs with foremen and skilled workers in each type of job led to the elimi- 
nation, revision, or addition of questions. Some questions were discarded 
because of regional differences in job practices or materials. 

The final validation was conducted on workers classified into three cate- 


gories: 


A. Experts, rated as superior workers by their supervisors and usually having a 
minimum of four years of experience in the given job. 
apprentices, helpers, and other workers not considered 


B. Beginners, including p c r 
to be thoroughly skilled in their occupation. 


C. Persons in related occupations who work in close proximity to the experts 
in sach field. For example. related groups tested with the oral trade tests for 
palete included carpenters. paper hangers, glaziers, plasterers, and sheet 


metal workers. 


The questions for each occupation were validated on 50 to 100 persons in 
and on 25 to 50 persons in groups B and C, respectively. Questions 


group A Bian y. 
were chosen on the basis of the size and significance of the difference be- 
tween percentages of persons answering the question correctly in each group. 
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The distributions of scores obtained by groups A, B, and C within the brick- 
layer sample are shown in Figure 103. It will be noted that the groups are 
sharply differentiated and that overlap is minimal, es 


pecially when group A 
is compared with the other two. 
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Fig. 103. Distribution of Scores of Contrasted V. 


alidation Samples on Oral Trade 
Questions for Bricklayers. (Data from Stead, Sh 


artle, et al., 5], p. 41.) 

Because of changes in Occupational processes 
tests require frequent revision, 
no sense a substitute for perfor 
only as a rapid means of ch 
by the job seeker. 


and materials, oral Hag 
It should also be emphasized that they are In 
mance or worksample tests, They are intended 


ecking on the vocational experience claimed 


TESTING IN THE PROFESSIONS 


^ rapidly growing application of standardized tests is to be found in large- 
scale programs for the selection of professional personnel Many of these 
programs are directed toward the selecti 


: ae, ents for admission to pro- 
fessional training. Tests ar id candidates for 
schools of medicine, dentistry, accounting, engineering. theol- 
ogy. and many other Although sch teens programs 
emphasize aptitudes and of subsequent perfotmanc in 
specialized training, reprofessional MU tes emen TOO 


the prediction 
achievement te 


sts on P 
an important part of most batteries, 
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It should also be noted that in the selection of students for professional 
Schools what is involved is not so much new types of tests as specially ad- 
ministered testing programs. There is no evidence that the various profes- 
sional fields require any special aptitudes not already covered by available 
tests. The typical professional school testing program includes a test of scho- 
lastic aptitude or general intelligence, one or more achievement tests on pre- 
professional training. and possibly tests of interests or other personality 
traits. Test results are often supplemented with biographical data, letters of 
recommendation, previous academic record, and interview ratings. 

The intelligence test employed in such a program may be a previously 
available instrument such as the ACE (Ch. 9). More often it is specially de- 
signed so that the content can be slanted toward the particular professional 
field under consideration. Such a choice of content increases face validity, in 
addition to permitting better security control of test materials. There is also 
some evidence to suggest that the predictive validity of these special tests is a 
little higher than that of the intelligence or scholastic aptitude tests available 
for general use. The specialized scholastic aptitude tests often contain meas- 
ures of reading comprehension for material similar to that which the student 
will encounter in professional school. Some of the tests yield separate verbal 


and quantitative scores. Spatial, mechanical, and motor aptitudes may also 


be separately tested when relevant to the field. i i 

Another level at which standardized testing programs are making major 
inroads is that of specialty certification and the selection of job applicants 
aining. Understandably these terminal testing pro- 


following completion of tr à ; M 
7 achievement tests in specialized con- 


grams draw much more heavily upon 
tent areas; but more general types of tests are not excluded. Examples of test- 
y | include medical specialty board examinations in 
surgery, anaesthesiology. and obstetrics and gynecology administered by 
ETS; the certification of clinical, counseling, and industrial psychologists by the 
American Board of Examiners in Professional Psychology (ABEPP): the 
National Teacher Examinations; the NLN Graduate Nurse Qualifying Exe 
amination administered by the National League for Nursing; the Officer Selec- 
tion and Evaluation Program of the U.S. Public Health Service: and the 
Department of State Foreign Service Examinations. Eor illustrative pur- 
poses. a few typical testing programs at both pretraining and posttraining 
levels will be examined in the following sections. Examples will be drawn 


from the fields of medicine. dentistry, l 


ing programs at this leve 


aw, engineering, teaching, and gradu- 


ate school studies. P l l 
ox ;Laino in 1930, the Association of American Medical Colleges 
Medicine. Beginning In l , . 


sponsored a testing program for selecting medical students. For many years, 
E ‘ a testing g 
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the test administered for this purpose was one devised by Moss (36), which 


measured principally knowledge acquired in premedical college courses and 
ability to understand and retain new material similar to that taught in medi- 
cal school. Since 1948, this program has employed the Medical College Ad- 
mission Test (MCAT), originally developed by ETS (50) 
over 4⁄2 hours, MCAT consists of four sep 
quantitative, understanding modern Society, 
includes both vocabulary and reading com 


science, social studies, and the humanities, 


- Requiring slightly 
arately scored parts: verbal, 
and science. The verbal section 
prehension tests in the fields of 


the tests are appreciably speeded, 
stimates. Intercorrelations of the 
intercorrelations undoubtedly re- 
hree of the four Parts. In reported follow- 
the separate Parts against freshman rank 
as the criterion, nearly all cor- 
orrelations were found with Medi- 


these coefficients probably represent overe 
four scores average about .60. These high 
flect the large verbal saturation of t 
up studies, Validity coefficients of 


tends to make the correla- 
and followed through 
r, are not consistently 
College erades and medical 
: à number of instances the premedical grades ap- 
pear to be Slightly better predi est scores. 
To be sure, the are not peculiar to 
School Performance. In all fields. 
€ at least as effe 


ctive as specially 
Sslonal school 


achievement. But 
"b > A More valid predictor usually 
€sts are useful as a supplement rather than as a substitute 
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courses. An applicant from a college whose students are not highly selected 
and whose grading standards are relatively low would have an advantage in 
terms of grade average. Similarly, students who had elected the minimum of 
required preprofessional courses and had filled their programs with "snap" 
courses would probably have a higher over-all grade average than those 
whose preparation was more thorough and more appropriate for their chosen 
profession. It is in such situations that a uniform admission test proves. 
helpful. 

To evaluate the MCAT solely in terms of its predictive validity, however, 
would ignore the testing philosophy underlying its development. The MCAT 
Was not specifically designed to sample the kind of learning the student 
would encounter in medical school. Its object was not so much to predict 
performance in medical school courses as to assess knowledge, attitudes, 
and intellectual skills considered to be desirable in a prospective physician. 
It is on this basis that the test on understanding modern society was included. 
In 1950, Stalnaker, then director of the medical school testing project, re- 
marked: 


While I should be unwilling to discourage anyone from correlating any two 
Variables, | am neither impressed nor concerned when a low correlation is found 
between scores on a test in understanding modern society and grades in laboratory 
Work in gross anatomy. I continue to favor selecting the men for the study of medi- 
cine who have some awareness of social sciences (50. p. 50). 


Such a statement is a clear expression of the viewpoint of content validation. 
And it is in terms of content validity that the test was originally devised 
its use justified. 

In a sense, the aim of MCAT appears to be the PEER ota ul.) 
Mote criterion than medical school grades. Thus the inclusion of tests perz 


and 


taining to general cultural background and current affairs implies an expec- 
tation that performance on these tests is related to the ultimate criterion of 
effective functioning as a physician in our culture. Theoretically, such a cri- 
terion could be defined and utilized in determining the predictive validity of 
these tests. The practical difficulties presented by such a procedure, how- 
ever, are well-nigh insurmountable. Hence the tests are included on the basis 
Of their content validity and of the undemonstrated but generally accepted 
assertion that certain characteristics are desirable in a Physician. At the 
Same time, it would seem that at least some parts of the MCAT should be 
designed to predict medical school performance. It is certainly an understate- 
Ment to say that medical school training is relevant to ultimat 


; 3 € achievement 
in medicine. To be sure, if one desires to go beyond the predi 


ction of med; 
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cal school achievement, then those parts of the MCAT that are incorporated 
for this purpose should not be validated against medical school performance 
as a criterion. But other parts of the MCAT should be evaluated—and im- 
proved—as predictors of this training criterion. 

Dentistry. Unlike the MCAT, most tests developed for selecting 
to schools of dentistry have concentrated closely on the aptitudes required 
in the training program. Possibly for this reason, they have usually proved to 
be fairly successful predictors of course grades (6, 14, 43, 49, 54, 55). Bat- 
teries assembled for this purpose usually include 
aptitude and manual dexterity, as well as tests d 
ment in academic or theoretical courses. The former abilities play an impor- 
tant part in the technical and clinical training given in dental schools. Ac- 
cording to one survey of dental school curricula, 57 per cent of the dental 
student's time is devoted to manipulative activities (47). 

Early experimentation with such tests conducted at the Universities of 
Iowa (49) and Minnesota (14) yielded promising results, More recently, à 
Dental Aptitude Testing Program was developed under the sponsorship af 
the Council on Dental Education of the American Dental Association, in 
cooperation with the American Association of Dental Schools (43). Follow- 
ing a five-year experimental period, the battery was adopted in 1951 by all 
recognized dental schools. For the prediction of performance in technical 
Courses, this battery utilizes a spatial visualization test (similar to the 
Minnesota Paper Form Board described in Chapter 14) and a worksample 
requiring the subject to carve relatively simple geometric patterns in chalk. 
The other tests include à reading co 
an achievement test measuring fac 
principles in biology 
The ACE was chos 


applicants 


measures of mechanical 
esigned to predict achieve- 


application of 
ychological Examination. 
ate linguistic and quantitative 
ms based on non-dental stu- 


al school applicants could be compared 
opulations, 


10 in two g 


one dental school (61), On the other hand, 


Achievement Tests: Special Areas 477 


caution regarding the use of national test batteries such as this without local 
validation. Follow-ups of several classes at the University of Minnesota 
School of Dentistry yielded lower validities than those found in other pub- 
lished studies, as well as a different pattern of correlations between separate 
tests and the criterion of dental school grades. Because of differences in 
student populations, grading standards, curricular emphases, and other local 
conditions, any such batteries need to be validated within individual schools. 

Law. Prior to 1940, tests for the selection of law students were developed 
at a number of universities for use in their own law schools. The pioneer 
effort in this direction appears to have been made at Columbia University. 
However, the first test designed for common use in different law schools 
was the Ferson-Stoddard Law Aptitude Examination. The preparation of 
this test was begun in 1925 at the Universities of lowa and North Carolina, 
although standardization data were also obtained at several other universi- 
ties (2, 52). 

In 1943 Adams and his co-workers at the University of lowa developed a 
new legal aptitude test which was made available for general distribution 
rather than being restricted to law schools (1, 2. 3). Known as the lowa 
Legal Aptitude Test, it originally included the following seven verbal sub- 
tests: analogies, mixed relations, opposites, memory for the factual content 
ofa judicial opinion read two hours earlier, judging relevancy of legal argu- 
ments, reasoning. and legal information. The last-named test was included 
on the assumption that students interested in law would have learned certain 
common facts of law prior to formal study of the subject. Subsequent research 
with the lowa Legal Aptitude Test revealed that three of the subtests—rea- 
Soning, relevancy, and legal information—yielded a slightly higher multiple 
correlation with law school grades than did the entire original battery (4). 
Separate norms were therefore made available for the total score on these 
three tests. so that they might be used as a short form. 

Since 1948, the Law School Admission Test (LSAT) constructed by 
ETS has been administered to law school candidates on a national basis (27. 
28. 32). In about four hours of testing time, the LSAT yields a single score 
based on six subtests: Principles and Cases, in which the relevance of given 
Principles to described cases is to be judged; Dau Interpretation, designed to 
measure the comprehension of quantitative data in tabular or graphic form 
(see Fig. 104): Reading Comprehension for passages of general content; 
Reading Memory. similar to Reading Comprehension but requiring that 
questions be answered without referring back to passages; Error Recognition, 
Covering proficiency in the mechanics of writing: and Figure Classification. a 
Nonverbal reasoning test. Although the content of some of these subtests is 
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e Directions: This section of the test consists of questions based on charts, tables, and 
graphs. Each question is followed by five choices, only one of which is correct. When- 
ever the option "Not answerable" appears, it is to be understood to mean "Not 
answerable on the basis of the data given." 
Select the correct answer to each 
the answer sheet. 


DISTRIBUTION OF EMPLOYMENT IN NEW JERSEY BY INDUSTRY AND SEX—1940. 

1. Manufacturing. 

2. Trade— wholesale Per Cent of Total Persons Employed 

and retail. 0 10 20 30 40 
3. Personal services, 
4. Transportation, 
communication, 
Utilities. 

5. Professional and 

related. 

6. Finance, insurance, 
real estate, 
Construction. 
Government. 
Agriculture. 

All other, includ- 
ing those not 
reported. 


question and mark the corresponding space on 


Industry 


pedi TESI 
COD MAN AHN ROWN = 


Questions 21-23 are based on the 


graph above, 
21. Which of the industrie 


22. Approximately how ma 


industry? Were employed in the construction 
(A) 4 (B) 6 (c) 8 (D) 10 (E) Not answerable 
23. Out of every 100 persons em loyed in th "MS N 
how many were women? ployed in the Manufacturing industry, approximately 
ALIS (8) 20 (C) 30 (D) 40 (E) Not answerable 
ANSWERS: 21—C, 22 23. c. 


Fig. 104. Sample Items from Data In 
Test. (From 32, P. 11; reproduced i 


terpretation Subtest issi 
i 9 9f Law School Admission 
by permission of Educational Testing Service.) 


scribed Iowa test. Either policy c 
Specific information tests can be 
to have predictive value, 
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Follow-up validity studies have been in progress since the initiation of the 
LSAT program. Available data indicate that, when combined with pre-law 
grades, this test yields correlations of about .50 to .70 with the criterion of 
law school grades (44). In a combined survey of 4138 students in 25 law 
schools, first-year law school grades correlated .36 with undergraduate 
grades, .45 with LSAT scores, and .54 with the best weighted combination 
of LSAT scores and undergraduate grades.' 

Despite the increasing homogeneity of student populations from the first to 
the third year of law school, LSAT scores tend to correlate as highly with 
three-year grades as with first-year grades. In one university, the test yielded 
exactly the same correlation (.56) with first-year law grades and with three- 
year law grades (8). A study of 601 Yale Law School students graduating 
between 1953 and 1957 yielded a multiple correlation of .53 between a 
combination of prelaw grades and LSAT scores and the criterion of three- 
year law grades (9). This correlation is high in view of the restriction of 
range resulting from the use of LSAT in admitting candidates and in view of 
the low reliability of criterion grades. It should nevertheless be borne in mind 
that validity coefficients vary widely from one law school to another. a 
finding that highlights the need for local validation. 

Engineering. Several batteries for the selection of engineering students have 
been assembled from time to time (30. 35. 54). These batteries generally 
utilize previously available standard tests, including mechanical comprehen- 
sion and assembly tests, spatial visualization tests, a measure of general 
scholastic aptitude such as the ACE, and achievement tests in mathematics. 
science, and English. Mathematics achievement tests have usually proved to 
be the best single predictor of engineering school performance. Prc-engineer- 
ing high school or college grades have high predictive validity and are like- 
wise employed in conjunction with test scores. English usage. vocabulary, 
and reading comprehension are relevant to the understanding of lecture and 
reading material encountered in engineering school, as well as to the prepar 
tion of descriptive reports. 

A test at a higher level, designed for selection of candidates for graduate 
engineering training as well as for industrial jobs is the Minnesota Engineer- 
ing Analogies Test, familiarly known as the MEAT (15). Modeled after the 
Miller Analogies Test (Ch. 9), the MEAT consists of analogies items with 
à heavy mathematical and scientific content. The analogies may be expressed 
wholly in verbal terms, wholly in mathematical terms, or in a mixture of the 
two. The content of the items is drawn chiefly from the core courses taken 
by all engineering students during their first two years. 

Like the Miller Analogies, the MEAT is a restricted test, 


a- 


administered 
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only at approved centers. Tentative percentile and Sue pesce 
: à tudents and on employed engineers are provide J, ut the: in p 
a norms is urged. Internal consistency reliability coefficients of 
MES e the two available forms range from .75 to .85. Because of these 
rather low reliabilities, use of both forms together is recommended. When 
administered with an interval of two days or less, the two forms correlated 
from .71 to .88. Content validity was sought in terms of current curricular 
coverage. Some data are available on concurrent validity 
tions with engineering school grades 
correlations with supervisory rating 


. including correla- 
and faculty ratings of students, as well as 


s of employed engineers. Although vary- 
ing widely in specific groups, the former correlations cluster between .40 and 
-60, the latter between .25 and .35. 
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of the testing program of the Grad 
cussed in a later section. The selection of engineers is 


also closely related to 
the general problem of identifying high-level talent in Science, a problem 
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talent in engineering. 


Teaching. Considerable research has been done on the use of tests in the 


acher training courses, the most valid 
indicators have proved to be ious academic grades, general intelligence 
tests, and academic achievement te 


sts. In a Survey of available published 
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ar of teacher training courses. Intelli- 
gence tests yield correlations of about the same order of magnitude. Among 
achievement tests, English tests are generally the most valid predictors. 
Other measures of achievement frequently employed for this purpose are 
reading comprehension, general culture, and contemporary affairs tests. Sev- 
eral combinations of high school grades, 


intelligence tests, and achievement 
tests have yielded correlations in the .60's 


S With performance in teacher train- 
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The National Teacher Examinations (38). administered annually at desig- 
nated centers throughout the country, represent a terminal achievement test 
given upon completion of teacher training. Developed and conducted by 
ETS, this testing program is used by school systems as an aid in selecting 
teachers, as well as by teacher-training institutions as a means of evaluating 
both the achievement of their students and the effectiveness of their training 
programs. The National Teacher Examinations are designed to measure the 
professional background, general cultural knowledge, English usage, and in- 
tellectual level of candidates for teaching positions, as well as their prepara- 
tion in one or two chosen fields. They include a series of Common Examina- 
tions, covering the individual's general competence for teaching, and a series 
of Optional Examinations, covering mastery of subject matter in specialized 
areas. 

The Common Examinations consist of five tests: Professional Informa- 
tion, Social Studies-Literature-Fine Arts, Science and Mathematics, Eng- 
lish Expression, and Nonverbal Reasoning. Sample items from each of these 
tests are reproduced in Figures 105A and 105B. It will be noted that the 
Nonverbal Reasoning Test is similar in principle to the Progressive Matrices 
discussed in Chapter 10. Optional Examinations are available in many 
fields, including elementary school education, early childhood education, 
English language and literature, social studies, biological sciences, physical 
Sciences, mathematics, art education, industrial arts, physical education, 
business education, home economics, and music education. All items in both 
Common and Optional Examinations emphasize understanding and applica- 
tion of knowledge rather than memory for factual details. 

Scores on separate tests, as well as weighted totals on each battery, are re- 
Ported in the form of "Scaled Scores." These are standard scores with a mean 
Of 60, which permit direct comparisons among tests. For normative inter- 
Pretation, percentile norms are also provided on the basis of the nationwide 
Sample tested each year. Internal-consistency reliability coefficients of sepa- 
Tate tests and of weighted composites are all close to or over .90. In the con- 
Struction of the tests, validity was considered largely in terms of content 
analysis and internal consistency. Item-test correlations were computed within 
cach separately timed subtest. Some data on concurrent validity are also 
&Vailable, These include significant mean rises in scores among different 
Mio eae aaa v oll ety ced tt 
er eor. T ME ing eftecti E Be i 2 P E emp oye teach- 
% udi ata are obviously meager, the major emphasis being placed on 

ntent validity. 

When first introduced. the National Teacher Examinations aroused con- 
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Professional Information 


S. 'The chief disadvantage of constructing the school curriculum wholly on the 
basis of an analysis of adult activities is that the 
1—resulting curriculum would be too difficult for pupils 
2— schools would be restricted to vocational training 


3—resulting curriculum would be very different from the typical curriculum at 
the present time 


4—resulting curriculum would include only the tool subjects of reading, writing, 
and simple arithmetic 


5—present needs of the pupils would not be taken into consideration 


History, Literature, and Fine Arts 


10. A rose window would be likely to be found in a 
1—Greek temple 
2—baroque church 
3—Gothic cathedral 
4—Tudor palace 
5—house by Frank Lloyd Wright 


Science and Mathematics 


- The most important reason why apple growers frequently place beehives in their 
orchards is that bees 


A—eat insects which are injurious to apples 

B—produce honey of superior quality when living in apple orchards 
C—are conveniently raised in orchards 

D—help in cross-pollinating apple blossoms 

E— make honey from apple blossoms 


English Expression 
Which of the underlined parts of the sentence is incorrect? If the sentence 
contains no error, record 0. 


17. The sophomore who transferred from Tulane was the heaviest of all the other 
1 2 3 4 


candidates for the team. 


Fig. 105A. Sample Items from Common Examinations of the National Teacher 
Examinations. (Reproduced by permission of Educational Testing Service.) 


siderable controversy. With continued use, they have gained wider accept- 
ance. It is recognized, of course, that the tests were designed to assess only 
knowledge. Teacher effectiveness depends also upon attitudes, motivation, 
emotional adjustment, specific experience in teaching, and other factors that 
the tests do not undertake to measure. It should be noted that another instru- 
ment available at the same level is the Advanced Test in Education of the 
Graduate Record Examinations. This test is more restricted in its coverage; 


yielding only à single score on professional knowledge in the field of educa- 


tion. 
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Nonverbal Reasoning 


Directions: Each problem in this test consists of an incomplete pattern. 
The complete pattern would be made up of nine figures arranged in order. 
You are to discover how the figures are related and determine the correct 
figure for space IX. Try the following problems and indicate your answer 
to each by blackening the corresponding space on the answer sheet. 


[ee] FIED) 


1 2 3 4 


Explanation: In problem 18 notice how the figures change as they go across 
each row of the pattern. They become darker. As they go down, the figures 
become larger. Therefore, the correct figure for space IX is large and 
black. Answer choice 4 is the correct answer. 


In problem 19 the figures acquire more dots as they go across the top 
Tow. As they go down, the point of the figure is rotated a quarter of a 
turn to the right. Therefore, the correct figure for space IX has three dots 
and its point is directed toward the bottom of the page. Answer choice 3 
1s the correct answer. 


_ Fig. 105p, 


à Sample Items from Common Examinations of the National T 
Xaminations. ( Teacher 


Reproduced by permission of Educational Testing Service.) 


e oe School Studies. Standardized tests are widely used in the selection 
x. udents for graduate study toward the master's or doctoral degrees in 
bis academic fields. The Miller Analogies Test, discussed in Chapter 9, 
Nd ssigüed specially for this purpose. Since it yields a single, over-all 
Ehud is in other ways similar to general intelligence tests, it was con- 
Nis in the section devoted to that type of test, rather than in this chapter. 
bi daas however, graduate students have much in common with the pro- 
bres School students discussed in earlier parts of this chapter. Like many 
dne A school students, they are college graduates pursuing specialized, 
raining which is often professionally oriented. A large propor- 
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tion will eventually engage in research or in teaching at the college level or 
higher. It is therefore to be expected that some of the general findings re- 
garding professional school students will also apply to graduate school stu- 
dents. 

For a number of years, many graduate schools have been making use of the 
Graduate Record Examinations (20, 26, 31, 37). Known as the GRE, this 
series of tests originated in 1936 in a joint project of the Carnegie Foundation 
for the Advancement of Teaching and the graduate schools of four eastern 
universities. In 1948, the GRE project was transferred to ETS. Currently, 
the GRE are administered in two types of testing programs. The National 
Program for Graduate Student Selection is concerned with the testing of 
students at designated centers prior to their admission to graduate school. 
The test records are used by the universities for admission purposes, as well 
as for selecting recipients of scholarships, fellowships, and special appoint- 
ments. The GRE are also employed in an Institutional Testing Program, in 
which colleges and universities administer the tests to their own students. In 
this case, the test records may be utilized as aids in such functions as student 
guidance. admission of students to candidacy for a degree, and evaluation 
of the effectiveness of instruction. In both programs, the tests are scored and 
retained by ETS. 

The GRE consist of an Aptitude Test, Area Tests, and Advanced Tests. 
The Aptitude Test is essentially a general intelligence or scholastic aptitude 
test suitable for advanced undergraduates or graduate students. Like many 
such tests, it yields separate verbal and quantitative scores. The Area Tests. 
providing scores in Social Science, Humanities, and Natural Science, were dis- 
cussed in Chapter 16 as an example of a general achievement battery for 


the college level. Advanced Tests are available in many fields of specializa- 
tion, including Biology, Chemistry, Economics, Education, Engineering. 
French, Geology. Government, History, Literature, Mathematics, Music, Phi- 
losophy. Physics, Psychology, Scholastic Philosophy, Sociology, Spanish. and 
Speech. 

Scores on all GRE tests are reported in terms of a single standard score 
scale with a mean of 500 and an SD of 100. These scores are directly com- 
parable for all tests, having been "anchored" to the Aptitude Test scores of a 
reference group of 2095 seniors examined in 1952 at 11 colleges. A score 
of 500 on an Advanced Physics Test, for example, is the score expected from 
Physics majors whose Aptitude Test score equals the mean Aptitude Test 
score of the reference group. Since graduate school applicants are a selected 


«ference to academic aptitude, the means of most groups actu- 
sample with refere aca p : 
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ally taking each Advanced Test in the Graduate Sdent einen Lu 
will be considerably above 500. Moreover. there are recs qoe aie i P. 
in the intellectual caliber of students majoring In Sur sg d) 
normative interpretation. therefore. the perennes given for indi a 
groups are more relevant and local norms are sell WEM vestigated in a num- 

The reliability and validity of the GRE have teen an es ge epic 
ber of different student samples (20, 31). Suder Richardson er E 
reliability coefficients of the separate sedet Pa p ps oed Si 
after n interval of one year yielded reliabilities of ed ins ; z a 
years. the reliability coefficients ranged from WAE 23, Pre e ya : d 
has been checked in terms of such criteria as graduate school gr ad G success 
Versus failure in graduate school, instructors’ ratings. and performance on 


Ph.D qualifying examinations. Studies conducted in a number of univer- 
: f : 2 d i “4 ^r] ar n © d 
Sities indicate that the GRE are not appreciably superior to undergraduate 
x » d e à vas é ‘ d 
grades as predictors of graduate school performance. But when pue 
S i » i redicti i ^ d aine 
With. grades they permit more effective prediction than = PARSER: 
a si ‘ DT oe : pM 
ar ; e multiple correlations found with such 
When grades alone are used. The mi à poe 

Combination of predictors were usually in the n .60's. V 

lt should be noted, however, that these tests are designed to 


a 


serve 
Other functions besides the prediction of graduate school performance. As in 
Some of the previously described professional aptitude tests, the GRE are em- 


Ployed partly as a measure of the candidate's breadth of cultural back- 
cay qualifica- 


Sround, verbal comprehension, quantitative reasoning, and other 
tions considered important in the selection of graduate students, To the 


extent that the tests are used for these purposes, their effectiv 


eness is more 
" ; tas ~ 1 ictive validity. 
à matter of content validity than of predictive y 


THE CONTINUUM OF ABILITY TESTING 


It has been repeatedly noted that the distinction between 


aptitude and 
achieve 


ment tests is not so basic as was once supposed. The new- 


type, 
“Oad achievement batteries considered in Chapter 16 were seen to bridge 
the Sap between traditional course-oriented achievement tests 


and tests of 
apparent that 
ird to their dependence upon 
experience. In this respect. traditional achievement tests, 
achievement tests, and intelligence tests differ only in degree. Within 


“me continuum may be placed special aptitude tests 
tude b 


S^neral intelligence or scholastic aptitudes. It should now be 
all ability tesis fall along a continuum with rega 
Specifieq prior 
broad 
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and multiple apti- 
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tests may occupy widely separated positions along the continuum, from the 
highly verbal type of test closely dependent upon schooling. through non- 
language and performance tests, to the so-called culture-free tests. 

As we look over the many kinds of ability tests that have been considered 
in Parts 2 and 3, we can see that they may all be regarded essentially as 
tests of “developed abilities." The relations among them can be seen in Figure 
106, showing the continuum along which ability tests may be ordered. At 


Traditional New-type, broad — Verbal-type Non-language "Culture-free" 
achievement achievement intelligence and performance tests 
tests tests and aptitude tests 
tests 
Seashore- Metropolitan Miller Stanford- Army Beta Leiter 
Bennett Achiev. Tests Analogies Binet 
Stenog. TOGA Progressive 
Prof. Test STEP DAT Matrices 
CEEB Achiev. WAIS Arthur 
Test in MCAT WISC Performance 
German 
1 Bennett Lorge- 
Cooperative Mech. Compreh. Thorndike Davis-Eells 
Trigonometry 
Test 7 
Meier Art- 
Judgment 


Teacher-made 
Course Examinations 


Fig. 106. The Continuum of Ability Tests. 


one extreme are the traditional, factual, course-oriented achievement tests. 
These tests are closely dependent upon a fully specified, uniform course of 
training. As we move past the new-type achievement tests, which stress 
intellectual skills rather than course-linked factual content, we come to the 
predominantly verbal intelligence tests. Situated approximately in the center of 
the continuum, such intelligence tests characteristically draw upon experi- 
ences shared by most persons in a defined cultural group, such as middle- 
class American urban public school children. Moving on farther, we pass 
non-language (including non-reading) and performance tests and finally 
reach the "culture-free" tests which are based upon experiences common to 
many cultural groups. 

For illustrative purposes, a few tests have been placed at various points 
along this continuum. It will be noted that the tests do not fall into sharply 
differentiated categories. A test like the Medical College Admission Test 
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(MCAT), for instance, partakes of the features of both a broad feat aaa 
test and a verbal intelligence test. Tests like the WAIS, WISC, and Lorge- 
Thorndike cut across verbal and non-language or performance types isa 
they have separate parts falling into each of these categories. raene ade 
examinations cover a broad area since they may vary from the most factual. 
traditional, course-oriented test to much broader t 
sion and application of knowledge. A . : m 
As a review, the student may try inserting into this continuum other tests 
that have been discussed in the preceding chapters. The exact position of 
Some tests may prove to be a matter of controversy. A case in pus is "s 
relative position of SCAT (Ch. 9) and STEP (Ch. 16). Superficially, SC 
is an intelligence or scholastic aptitude test, STEP an achievement battery. 
Hence SCAT should be placed to the right of STEP on the continuum. But 
upon closer inspection, it appears that SCAT draws freely upon Pan 
word knowledge and arithmetic processes learned in school, while STEP has 
Succeeded in measuring reasoning and other intellectual skills fairly 
Pendently of specific factual recall. It could be argued that STEP is f 
from traditional achievement tests than is SCAT. Now 
"approchement between traditional intelligence tests 


ment tests more clearly illustrated than in these tw 
test series. 


ypes stressing comprehen- 


inde- 
arther 
here is the increasing 
and traditional achieve- 
9 concurrently developed 


Another example of this relationship is provided by the Aptitude and Area 
ests of the GRE. There is evidence that corresponding parts of the Aptitude 
and Area Tests correlate more highly than the different parts of either test 
Correlate with each other. In one survey, the correlations betw 


een the Verbal 
and Quantitative scores of the Aptitude Test ranged from 41 to .48, but 
the Verbal score of the Aptitude Test correlated from .71 to .76 with the 
Soci 


al Science and Humanities Area Tests, and the Quantit 


Ptitude Test correlated from .61 to .64 with th 
(20), 


ative score of the 
€ Natural Science Area Test 
If the misleading traditional labels are discarded and we recognize 
3t all these tests measure developed abilities, the function Served by each 


est can then be identified more realistically. 
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Measurement of Personality Traits 


CHAPTER 18 


Self-Report Inventories 


In the classification of psychological tests outlined in Chapter 2 it was 
Noted that tests designed for the measurement of personality characteristics 
are concerned primarily with the emotional, social, and motivational aspects 
9f behavior, In the next four chapters, the major types of tests currently em- 
Ployed for these purposes will be surveyed. This chapter will deal with per- 
Sonality inventories. Chapter 19 will consider available techniques for the 
Measurement of interests and attitudes. The instruments to be covered in 

9th of these chapters are essentially paper-and-pencil, self-report question- 
naires suitable for group administration. The use of projective techniques for 
the assessment of personality characteristics will be discussed in Chapter 

20. In the last chapter. we shall examine a number of Tstellaneous ap- 
Proaches to the measurement of personality, many of which are still in an 
experimenta] stage. ; 

, The number of available personality tests Suns into several hundred. Espe- 
Clally Numerous are the personality inventories and the projective tech- 
niques, In this book, we shall be concerned primarily with the types of ap- 
Proaches that have been explored. A few of the most widely known tests of 
Pach type will be briefly described for illustrative purposes. For a more com- 

Tehensive survey of existing instruments. the reader should consult such 
Sources as The Mental Measurements Yearbooks, as well as books devoted 
“Ntirely to the measurement of personality (e.g.. 1, 46). 

n the development of personality inventories, several approaches have 
Sen followed in formulating, assembling, selecting, and grouping items. 
Tong the major procedures in current use are the formulation of items in 

rms of content validity, empirical criterion keying of an item pool, factor 


Nalysis of items or subtest scores, forced-choice arrangement of items on the 
ae ems 
Ws Of social 
Oosing v 


te 


desirability, and the application of personality theory in 
ariables and constructing items. Each of these approaches will be 


'Scusseg and illustrated in the following sections. It should be noted, how. 
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ever, that they are not alternative or mutually exclusive techniques. Theoret- 
ically, all could be combined in the development of a single personality 
inventory. In actual practice, several inventories have utilized two or more 
of these procedures. 


CONTENT VALIDATION 


The prototype of self-report personality inventories was the Woodworth 
Personal Data Sheet (cf. 39), developed for use during World War I. This in- 
ventory was essentially an attempt to standardize a psychiatric interview and 
to adapt the procedure for mass testing. Accordingly, Woodworth gathered 
information regarding common neurotic and preneurotic symptoms from the 
psychiatric literature as well as through conferences with psychiatrists. It was 
in reference to these symptoms that the inventory questions were originally 
formulated. The questions dealt with such behavior deviations as abnormal 
fears or phobias, obsessions and compulsions, nightmares and other sleep dis- 
turbances, excessive fatigue and other psychosomatic symptoms, feelings of 
unreality, and motor disturbances such as tics and tremors. In the final selec- 
tion of items, Woodworth applied certain empirical statistical checks, to be 
discussed in the next section. Nevertheless, it is apparent that the primary 
emphasis in the construction of this inventory was placed upon content 
validation, as indicated in the sources from which items were drawn as well 
as in the common recognition of certain kinds of behavior as maladaptive. 

One of the clearest examples of content validation in a current personality 
inventory is provided by the Mooney Problem Check List (63). Designed 
chiefly to identify problems for group discussion or for individual coun- 
seling, this checklist drew its items from written statements of problems sub- 
mitted by about 4000 high school students, as well as from case records, 
counseling interviews, and similar sources. The checklist is available in junior 
high school, high school, college, and adult forms. The problem areas cov- 
ered vary somewhat from level to level. In the high school and college 
forms, they include health and physical development; finances, living condi- 
tions, and employment; social and recreational activities; social-psychologi- 
cal relations; personal-psychological relations; courtship, sex, and marriage: 
home and family; morals and religion; adjustment to school work; the 
future—vocational and educational; and curriculum and teaching procedure. 
Although the number of items checked in each area can be recorded, the 
test does not yield trait scores or measures of degree of adjustment. Em- 

hasis is on individual items as self-perceived and self-reported problems Or 
sources of difficulty. 
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Another checklist of needs and problems, suitable for grades 4 to 8, is 
the SRA Junior Inventory (68). The areas sampled by this inventory are 
designated as: About Me and My School. About Me and My Home, About 
Myself, Getting along with Other People, and Things in General. & novel 
feature introduced in this inventory is the use of response boxes of different 
Sizes to enable the child to suggest the magnitude of each problem, bs illus- 
trated in Figure 107. A similar procedure is utilized with a slightly different 


I want to learn how to read better... ggs 
I'wieh T bad tiorei DED anae aii | 
In the Junior Inventory, Form S, pupils check each statement as a big prob- 


lem (by marking the big box), a middle-sized problem (by marking the 
middle-sized box), a little problem (by marking the little box), or no prob- 
lem (by marking the circle). 


Fig. 107. Sample Items from SRA Junior Inventory. (Reproduced by permission of 
cience Research Associates.) 


Set of problem areas in the SRA Youth Inventory (69), for grades 7 to 12. 
©SPonses to both of these inventories yield a score in each area that may be 
SValuated in terms of norms provided for this purpose, but such a quanti- 
ication Of responses is of dubious value. l 
ention may also be made of the Bell Adjustment Inventory (13 ). which 
35 continued in active use since its initial appearance in 1934. The Student 
“orm, designed for the rapid screening of high school and college 
OT counselin 
Social 
Occup 


students 
8 purposes, yields adjustment scores in four areas: home, health, 


> and emotional. A less widely used Adult Form added à fifth score in 
ational adjustment. Items for the Bell Inventories were selected 
TOM existing inventories and were grouped into the five catego 
terms of their apparent content. Final item selection within each categ 
ased on internal consistency, responses to each item being evalu 
total Score on the preliminary scale for the appropriate area. 

S a final example of inventories relying primarily upon content valid 
We May consider the California Test of Personality (78), Available in five 
levels, this inventory undertakes to span the age range from kindergarten 
o College Students and unselected adults. In the type of scores Obtained and 
the Proposed interpretations of such scores, the California Test of Per: 
1 resembles empirically developed personality tests, In its construction, 
JOWeve,. content validation appears to have predominated. Separate scores 
Ste foung in 12 areas. identified by such labels as Sense of personal worth, 
Withdrawing tendencies, social skills, and school relations. From these part 


iargely 
ries in 
Ory was 
ated against 


ation 


sonal- 


496 Measurement of Personality Traits 


scores, a total adjustment score and two subtotals covering personal and 
social adjustment are also computed. National norms are provided for evalu- 
ating these 15 scores. 

It should be added that with all these inventories some efforts have been 
made toward empirical validation of scores in each problem area. Few 
personality tests in use today rest their claims entirely on content validity. 
All tests cited in this section, however, have relied principally on content 
validity in the formulation, selection, and grouping of items. 


EMPIRICAL CRITERION KEYING 


Early Inventories. Empirical criterion keying refers to the development of a 
scoring key in terms of some external criterion. Such a procedure involves 
the selection of items to be retained and the assignment of scoring weights to 
each response. In the construction of the previously cited Woodworth Per- 
sonal Data Sheet, some of the statistical checks applied in the final selection 
of items pointed the way for criterion keying. Thus no item was retained 
in this inventory if 25 per cent or more of a normal sample answered it in 
the unfavorable direction. The rationale underlying this procedure was that 
a behavior characteristic that occurs with such frequency in an essentially 
normal sample cannot be indicative of abnormality. The method of con- 
trasted groups was likewise employed in the selection of items. Only symp- 
toms reported at least twice as often in a previously diagnosed psychoneu- 
rotic group than in a normal group were retained. 

Another early example of criterion keying is provided by the Allport A-S 
Reaction Study (2, 4, 70). Described as a measure of ascendance-submis- 
sion (A-S), this inventory seeks to assess the individual’s tendency to domi- 
nate his associates or be dominated by them in face-to-face contacts of 
everyday life. Each item begins with a brief description of a situation that 
might commonly be encountered at a meeting, in school, on a bus, in a 
repair shop, or in other familiar settings. Two or four alternative ways of 
meeting the situation are listed, the subject being instructed to indicate which 
alternative most nearly represents his usual reaction. The responses vary in 
the degree of ascendance or submission they represent and are weighted ac- 
cordingly in the scoring. 

The scoring weights for the A-S Reaction Study were empirically estab- 
lished on the basis of the criterion ratings obtained by those subjects in the 
standardization sample who chose each response. Each subjects criterion 
rating represented a mean of five ratings for social dominance, including 4 
self-rating and four ratings by associates. Following publication of the test, 


Self-Report Inventories 497 


Considerable evidence for the validity of total scores has been accumulated, 
chiefly by the method of contrasted groups. In addition to enjoying wide 
Popularity in its own right, the A-S Reaction Study has influenced the de- 
velopment of many other inventories. This test is one of the most durable of 
the early personality inventories. It might also be noted that dominance 
has Proved to be one of the most frequently identified and clearly estab- 
lished traits in subsequent factorial analyses of personality. 

An examination of questionnaires designed to measure different aspects of 
Personality and bearing dissimilar trait labels reveals many common items. 
It was this observation that led to the development of the Bernreuter Per- 
Sonality Inventory (16). The 125 Yes-No-? items constituting this inventory 
Were based on questions chosen from four previously existing inventories, 
Four Scoring keys were developed for use with the Bernreuter inventory, 
each response being assigned a separate weight on each of the four keys. The 
resulting four scores were described as: BI N—neuroticism, B2S—self-sufti- 
ciency, BST —introwerskan, and B4D—dominance. The correlations between 
these Scores and the four separate tests from which the Bernreuter was de- 
Tived ranged fr .94. It thus appeared that a single, short inventor 
Could O nier the mus rennan that had previously "a 
Qquired four different inventories. This time-saving feature of the Bernreuter 
Probably accounted for much of its popularity. 

An analysis of the intercorrelations among the four Bernreuter scores 
Clearly indicates, however, that these scores do not measure four independ- 
ent aspects of personality. The neuroticism and introversion scores, for ex- 
ample, Correlate .95 with each other. Part of this correlation is undoubtedly 
Attributable to the overlap of specific factors and of chance errors, resulting 
Tom the use of common items in obtaining the two scores. To a large extent, 
9wever. such intercorrelations reflect the overlap that exists among cate- 
80ries commonly used in describing personality. Most of the traditional self- 
"eport inventories utilized a priori trait differentiations that were not always 
Orne out by empirical findings. 

. An early factor analysis of the four Bernreuter scores, conducted by Flan- 
agan (38), demonstrated that two independent measures could be derived 
rom the inventory. These were designated FIC (Confidence) and F2S 

Sciability ), Scoring keys for these two traits were subsequently added to 
: b test, making a total of six available keys. There is, of course, no justifi- 
ation for using all six keys, since such a practice only further compounds 

e Overlap: The two Flanagan keys should be regarded as substitutes for 


T4 
Intro hurstone Persana led after the earlier Woodworth P.D. Sheet), Laird’ 
3 Vers sonality Schedule (modeled a 3 b t), Laird's 
Scale tsion-Extroversion Test, Allport A-S Reaction Study, and Bernreuter's Self-Sufficiency 
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the original four. Other factor analyses, using either score or item intercorre- 
lations, have in general confirmed Flanagan's results (cf., e.g., 11). Most 
studies have found two relatively independent traits which seem to corre- 
spond to those identified by Flanagan, although these traits have been de- 
scribed in different terms. Because they correspond more closely to familiar 
categories, however, the four original keys have been applied more fre- 
quently than the two uncorrelated keys. Moreover, the use of correlated 
factors (oblique axes) is now commonly accepted in factorial research, 
especially in the arca of personality. 

The Minnesota Multiphasic Personality Inventory. The outstanding example 
of criterion keying in personality test construction is to be found in the Min- 
nesota Multiphasic Personality Inventory, commonly known as the MMPI 
(29, 31, 50, 51, 83). Partly because of its clinical origins and partly because 
of certain technical innovations, the application of this inventory has at- 
tained unprecedented proportions. Since its appearance in 1940 and the pub- 
lication of its first official manual in 1943, the MMPI has also stimulated 
a flood of research. In a survey published in 1956, Welsh and Dahlstrom 
(83) brought together 66 of the most important articles about the MMPI 
appearing between 1940 and 1954 and included a bibliography of 689 
titles. In 1959, The Fifth Mental Measurements Yearbook listed 719 ref- 
erences, and the 1960 Handbook by Dahlstrom and Welsh (29) contains 
over 1200. 

The MMPI was originally developed “to assay those traits that are com- 
monly characteristic of disabling psychological abnormality" (50). The in- 
ventory consists of 550 affirmative statements, which the subject is asked 
to classify into three categories: True, False, and Cannot say. In the in- 
dividual form of the test, the statements are printed on separate cards, 
which the subject sorts into three stacks. Later, a group form was pre- 
pared, in which the statements are printed in a test booklet and the re- 
sponses are recorded by the subject on an answer sheet. Both forms were 
designed for adults from about 16 years of age upward, although they have 
also been employed successfully with somewhat younger adolescents (52). 
Use of the individual form is generally recommended, especially when test- 
ing disturbed patients or persons of low educational or intellecutal level. The 
MMPI items range widely in content, covering such areas as: health, psy- 
chosomatic symptoms, neurological disorders, and motor disturbances; sexual. 
religious, political, and social attitudes; educational, occupational, family, and 
marital questions; and many well-known neurotic or psychotic behavior 
manifestations, such as obsessive and compulsive states, delusions, hallucina- 
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Pa ideas of reference. phobias, sadistic and masochistic trends, and the 
ke. A few illustrative items are shown below: 


I do not tire quickly. 


N soie ai i i 
lost people will use somewhat unfair means to gain profit or an advantage 
rather than to lose it. 


lam worried about sex matters. 
When I get bored I like to stir up some excitement. 
l believe 1 am being plotted against. 


Pee first published, the MMPI provided scores on nine "clinical scales." 
ipe dias scales consists of items that differentiated between a specified 
pes i ide normal control group of approximately 700 persons. The 
esit E at visitors at the University ol Minnesota hospitals, and repre- 
Ende bs any adequate cross section i the Minnesota population of both 
Fai tii the ages of 16 and 585. Ihe scales were thus developed 
Me by criterion keying of items, the criterion being traditional psy- 
lagnosis. By this method, the following scales were prepared: 


1. Hs: Hypochondriasis 

2. D: Depression 

3. Hy: Hysteria 

4. Pd: Psychopathic deviate 
5. Mf: Masculinity-femininity 
6. Pa: Paranoia 

7. Pt: Psychasthenia 

8. Sc: Schizophrenia 

9. Ma: Hypomania 


I 

clin ins masculinity-femininity scale were selected in terms of frequency 
her By men and women, High scores on this scale indicate a pre- 
bitud to e interests typical of the opposite sex. Such scores have been 
dividual Barri homosexuals, especially among males, although in in- 

al cases high scores may have other interpretations. 
TOR feature of the MMPI is its utilization of four so-called validity 
n effect van scales are not concerned with validity in the technical sense. 
ing. bid d ey represent checks on carelessness, misunderstanding, malinger- 
Validating ia operation of special response sets and test-taking attitudes. The 
ores include: 
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Question Score (?) : the total number of items put into the Cannot say category. 


Lie Score (L): based upon a group of items that make the subject appear in 
a favorable light, but are unlikely to be truthfully answered in the favorable 
direction. ( E.g., I do not like everyone I know.) 


Validity Score (F): determined from a set of items very infrequently answered 
in the scored direction by the standardization group. Although representing un- 
desirable behavior, these items do not cohere in any pattern of abnormality. 
Hence it is unlikely that any one subject actually shows all or most of these 
symptoms. A high F score may indicate scoring errors, carelessness in respond- 
ing. gross eccentricity, or deliberate malingering. 


Correction Score (K): utilizing still another combination of specially chosen 
items, this score provides a measure of test-taking attitude, related to both L 
and F, but believed to be more subtle. A high K score may indicate defensiveness 
or an attempt to "fake good." A low K score may represent excessive frankness 
and self-criticism or a deliberate attempt to "fake bad." 


The first three scores (?, L, F) are ordinarily used for an over-all evalua- 
tion of the test record. If any of these scores exceeds a certain maximum 
value, the record is considered invalid. The K score, on the other hand, was 
designed to function as a "suppressor variable." It is employed to compute 
à correction factor which is added to the scores on some of the clinical scales 
in order to obtain adjusted totals. It should be noted that the utilization of 
the various validity scales is not completely standardized, but is left partly 
to the judgment of the clinician. Moreover, the validity scales are constantly 
undergoing redefinition and revision, in the light of new rescarch. 

Since the publication of the MMPI in its initial form, about 200 new 
scales have been developed. most of them by independent investigators who 
had not participated in the construction of the original test (29). A Social 
Introversion (Si) scale has been added to the original nine clinical scales 
and is now routinely included in the MMPI profiles, with the code num- 
ber “0.” In studies of high school and college students, the Si scale was 
found to be significantly related to number of extracurricular activities in 
which the student participated (83). Other Scoring scales are applied as the 
occasion demands. The available scales vary widely in the nature and 
breadth of the criteria against which items were evaluated. Several scales 
were developed within normal populations to assess personality traits un- 
related to pathology. Some scales have subsequently been applied to the test 
records of the original MMPI normal standardization sample, thus providing 
normative data comparable to those of the initial clinical scales (49). Ex- 
amples of these new scales include Ego Strength (ES), Dependency (Dy). 
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Dominance (Do), Prejudice (Pr), and Social Status (St). ee Dies 
been developed for highly specific purposes and are more limite 
$ ed y È : 
m administration, the MMPI now yields 14. scores, irae 
the 9 original clinical scales, the Si scale. and the 4 validating Pie pines 
On the original control sample of approximately 700 persons sd aene z 
the form of “T scores," or standard scores with a mean of 50 ang an ® 
10. These standard scores are used in plotting punica uie Uhsfeted b id 
ures 108 and 109. Any score of 70 or higher—falling : a s eile i 
the mean—is generally taken as the cutoff point c Lus ; a en as 
Pathological deviations. It should be noted, however, ns à 3 iis j. 2 E 
Cance of the same score may differ from one scale to anot i à Cerda iss 
On the Hypochondriasis and on the ac ia scales, for example. may 
Not indic; same severity of abnormality. e 
JR dern en dom on the MMPI profile oe eee 
believe that the occurrence of scores substantially below 50 may m d 
nostic significance, but no systematic interpretations have EPSE i is 
Yet. It seems likely, however, that marked deviations eae A is mee 
Maladjustment, rather than indicating Superior Nee. pud nE ati 
tudes may likewise contribute to the production of aina "a ice: 
Of possible personality difficulties that may be PW WA me Pei d 
“Cores is Provided by a study of college men Bern ix a comes ^ k , c i 
Parison with a control group whose scores fell within the nor ma ; ange. the 
low Pa scorers showed more academic difficulty, a greater Braportion yi üti- 
derachievers, and a higher incidence of difficulty with indie: ipsi 
tor Suggested interpretations of these findings in terms of “repressed or denie 
Ostility ICE TOT ek eae 
There is considerable evidence to suggest that, in general, the ges Er e 
Number and magnitude of deviant scores on the MMPI, the more likely it is 
that the individual is severely disturbed. For screening purposes, Nowever 
Shorter and simpler instruments are available. lt. is cleat that His principal 
“plications of the MMPI are to be found in differential diagnosis. In using 


he inventory for this purpose, the procedure is much more complex than 
the labels o 


m : cales might suggest. The test manual and 
n riginally assigned to the scale g ge 


Clateg Publications now caution against literal interpretation of the clinical 
Scales. For 
Phrenia Scale indicates the presence of semzophirenin, Other ! 

Show high elevation on this scale, and schizophrenics often score high on 


E a scor a cur in a norm 
ther Scales, Moreover, such a score m ay occur in norm 
Partly to 


example, we cannot assume that a high score on the Schizo- 
à Psychotic groups 


al person. It is 
Prevent possible misinterpretations of scores on single scales that the 
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code numbers O to 9 have been substituted for the scale names in later 
publications on the MMPI. 

The original clinical scales of the MMPI were based on a traditional psy- 
chiatric classification which, though popular, rests upon a dubious theorctical 
foundation. The artificiality of the categories employed in such traditional 
schemas has been a matter of concern in abnormal psychology for a long 
time. The fact that such categories prove unsatisfactory in actual practice is 
now generally conceded. Another difficulty is that a high score on any onc 
scale may have different implications depending upon the accompanying 
scores on other scales. In other words, it is the score pattern or profile rather 
than individual scale scores that should be examined. To facilitate the inter- 
pretation of such score patterns, a system of numerical profile coding has 
been developed. In such codes, the sequence and arrangement of scale num- 
bers show at a glance which are the high and low points in the individual's 
profile. For instance, the code 49-2 shows high scores in scales 4 (Pd) and 
9 (Ma) in decreasing order of size, and a low score in scale 2 (D). 


T IKE Hs D Hy Pd Mf Pa Pt Sc Ma 


1 T 
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Fig. 108. MMPI Profiles of a Normal Adult and a Psychotic. (Adapted from Weider, 
82. pp. 554 and 563, and reproduced by permission of The Ronald Press.) 


An Atlas for the Clinical Use of the MMPI (51) provides coded profiles 
and short case histories of 968 patients, arranged according to similarity of 
profile pattern. This material is offered as an aid in understanding the diag- 
nostic significance of each profile. A similar codebook utilizing data from 
over tour thousand college students examined in a college counseling center 
has been prepared for use by counselors (31). Current validation of the 
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MMPI is proceeding by the accumulation of empirical si joe 
who show each profile pattern or code. By such A process: the wp 
validity of each MMPI code is gradually Bult dp. The MMPI vii e 
compiled by Dahlstrom and Welsh (29) contains the most. comprehensive 
survey of available interpretive data on major profile patterns. TRE eine 

Examples of the kind of interpretations that are thus evolving include the 


5 


“neurotic triad.” which consists of scales 1 and 3. Codes with high points 


in these three scales have been found to characterize neurotics. More specii- 
cally, “conversion V" codes. beginning with 13 or 3l. occur many in 
“immature persons who dodge issues rather than meeting them Lapel 
(32. p. 23). In this profile. the maximum elevation is still in the first tite 
Scales, but scale 3 is higher than 2. thus producing a V "shaped appearance in 
the first part of the profile. Psychosomatic and other physical dead 
lypical of these profiles. Many psychotic profiles show high points in Seales 
6. 7, 8, and 9, as well as in the scales of the neurotic fad (cf. Fig. 108). 
Such a pattern has been designated as the “psychotic tetrad: (29, p.96). 

A series of studies on juvenile delinquents (52), involving both Concurrent 
and predictive validation procedures, showed Boe codes to be characteristic 
of delinquency. The typical elevation in code 4 ousted pa the BB 
Profile of a group of delinquent girls reproduced in Figure 109. Follow-up 
Studies of Vnselected ninth grade pupils showed that those with a2 profiles 
were somewhat more likely to become delinquent. Delinquents with these 
Profiles also tended to respond less favorably ta rehabilitation. In terms of 
test Tesponses, such a profile indicates rebelliousness. conflict. with family 
and Society, non-conformity. wide and changeable interests, and TECUETIMSN. 
Certain scales, such as 2 (D) and 0 (Si) seem to function as inhibitors in 
delinquency codes, high scores in these scales reducing the probability of de- 
linquent behavior. A survey of an adult prison sample likewise yielded 
Predominance of profiles with 4 high and O low (65). : 

Although the misleading psychiatric labels are dropped, it should be 
Noted that MMPI items are still grouped into scales on the basis of such 
Obsolescent categories. Factorial analyses based on intercorrelations of items 
and of scales indicate that items would be differently grouped on the basis 


Of their empirically established interrelations (27; 62: 83, pp. 255-281). 
OWever, 
Pert 


a 


an extensive body of normative data and clinical experience 

aining to the old scales has accumulated over the years. In order not 
r S 

to lose this store of information, later efforts have been directed toward the 


T-interpretation of the old scales in terms of empirically derived profile 
Codes. 
A 


Closely related limitation of the MMPI stems from inadequate relia- 
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Fig. 109. Mean MMPI Profiles of a Group of Delinquent Girls (N — 99) and E 
Group of Non-Delinquent Girls (N = 85). From Hathaway and Monachesi. 52. p. 32: 
reproduced by permission of the University of Minnesota Press.) 


bilities of some of the scales. The effectiveness of any profile analysis is 
weakened by chance errors in the scores on which it is based. If individual 
scale scores are unreliable and highly intercorrelated, many of the inter- 
score differences that determine the profile code may have resulted from 
chance. Retest reliabilities on normal and abnormal adult samples reported 
in the manual range from the .50's to the low .90's. The intervals between 
retests varied from a few days to over a year. Retest coefficients found in a 
group of college students, however, were generally lower, although only a 
one-week interval had elapsed between testings (41). The mean of these re- 
liabilitics was only .61. Six of the nine coefficients fell below .70, and two 
fell below .40. Moreover, split-half reliabilities computed for the same col- 
lege sample showed an even wider variation from scale to scale, ranging 
from —.05 for scale 6 (Pa) to .81 for scale 7 (Pt). 

Still another limitation of the MMPI pertains to the size and representa- 
tiveness of the normative sample. The standard scores from which all pro- 
file codes are derived are expressed in terms of the performance of the con- 
trol group of approximately 700 Minneapolis adults tested in the original 
standardization. Such a normative sample appears quite inadequate when 
compared, for example, with the nationwide standardization samples em- 
ployed with many of the ability tests discussed in Parts 2 and 3. That the 
norms may vary appreciably in different normal populations is illustrated 
by the finding that the means obtained by college students are consistently 
above 50 on some of the scales (26; 41; 83, pp. 574-578). In one study of 
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600 college students. 39 per cent received scores akaye 70 a a Yc Fed 
scales (41). The most satisfactory solution is Ae ape ae phe ponte 
gathering of specific norms, at least with reference to the incidence 
ee ability tests, personality tests can be ne in nt 
large subcultural as well as cultural differences. AS ee he be an beu 
studies conducted in other countries reveal StEnihcant RANDE. id én i 
scales when profiles are based on the original Mense il aie. 
76). Any explanation of such cultural and erii) ais poena 
Specific knowledee of cultural conditions and other Pp and P s ji 
Within each group. Cultural differentials may HUE os fy ee e 
(cf. 8, pp. 567-569, 596-598). Group. difterences in X ss ri ote 
for example, reflect nothing more than differences in nae Diu. Pe 
ual items or of instructions. High elevation in some groups cou a ions 
Strong traditions of self-depreciation and. ipn M gessi ada 
type of behavior considered socially desirable may li Va x - A 
In still other groups, high scores may indicate the pss g prt es 
tional problems arising from child-rearing practices, conflicts of social roles, 
Minorj : frustrations, and the like. 

xu pecie pupa from the MMPI. Besides stimulating a prolifera- 
tion of its own scoring scales, the MMPI has served as a basis for the develop: 
ment of certain other inventories. Among these is the Taylor MANNI" 
Sly Scale (77), Although not standardized or published, this nd dide apay 
Baineq prominence both in research and in clinical pn pax originally 
Constructed for use in experiments to test certain hypotheses regarding the 
elfects of drive on learning (74). For this purpose, five clinical psychologists 
Were asked to choose those MMPI statements fha ag regarded as overt 
Admissions of anxiety, The resulting scale consists of 50 items thus selected, 


Mixed with 175 buffer items drawn largely from the three validity scales 
(L, F, K). 


The item selection procedure followed by Taylor will be recognized 


as an 
example of content validation. The clinical psychologists who chose the anxi- 
ety items were functioning in the same way as educators who may be asked 
to evaluate items for an achievement test in social studies. Internal consist- 
“Ney in 


dices were also utilized in the final selection of items. Some evidence 
8 a ; 
Concurrent validity of total scores was later obtained by comp 


aring the 
Performance of neurotic and psychotic patients with th 


at of normal groups. 
E 
D the Course of the subsequent experiments in which this test was ad- 
Ministereg data pertaining to its construct validity we 


te likewise accumu- 
ated. This lest, in fact, has often been cited as an ex 


ample of construct 
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validation, probably because of the theoretical setting within which it orig- 
inated. 

It is interesting to note that the administration of the 50 anxiety items in 
the context of the Manifest Anxiety Scale apparently leads to somewhat 
different responses than are obtained with the same items in the MMPI for- 
mat. In one study, the correlation between scores on the same set of items 
presented in these two settings was only .68. This finding highlights the fact 
that an individual's perception of a personality test item may be influenced 
by the other items among which it is embedded. Despite its popularity, the 
Manifest Anxiety Scale should be regarded as a research instrument only. 
until more data on its reliability, validity, and norms become available. The 
original scale is suitable for college and unselected adult groups, but a form 
for children has subsequently been prepared (18). 

An effort to adapt the MMPI for use with normal high school students 
and college freshmen led to the development of the Minnesota Counseling 
Inventory (14). Many of the 355 true-false items of this inventory are re- 
worded MMPI items, and several of its scales have a close resemblance to 
MMPI scales. With norms based on over 20,000 high school students tested 
in ten states, this test provides scores in seven areas, designated as: Family 
Relationships, Social Relationships, Emotional Stability, Conformity, Adjust- 
ment to Reality, Mood, and Leadership. Although labeled in terms of the 
favorable end of the scale, “Conformity,” for example, bears a strong resem- 
blance to the MMPI Pd scale, “Adjustment to Reality" to the Sc scale. 
There are also two verification scores, similar to MMPI validity scales. Total 
scores on the different scales were validated by comparing random samples 
of students with groups nominated by teachers as outstanding examples of 
the trait in question. Split-half and retest reliabilities are satisfactory, but 
some of the scale intercorrelations are about as high as their reliability CO- 
efficients. The seven scores are thus not as distinct as their titles imply. Lit- 
eral interpretation of some of the scales, moreover, would be misleading. 
The inventory should be used only by counselors who are sufficiently familiar 
with its construction to evaluate scores properly. 

As a final example, mention may be made of the California Psychological 
Inventory (44), which has been described as “the sane man's MMPI.” 
Drawing about half of its items from the MMPI, this inventory provides 15 
scales derived largely by criterion keying, together with three verification 
scales. Items were selected in terms of such criteria as course grades, partici- 
pation in extracurricular activities, prominence as a leader, and ratings for 
various traits. Examples of the scales include dominance, sociability, sense of 
well-being, self-control, tolerance, achievement via conformance, achievement 
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Via independence, and flexibility. Standard score norms on over six thousand 
Cases of each sex are provided. Data are reported on retest but not on split- 
half reliability. High intercorrelations of scales indicate considerable overlap 
and redundancy. Cross-validation of scale scores against appropriate criteria 


h 


terpretation of composite profiles, but the procedures suggested for this pur- 


ave yielded rather low correlations. Major emphasis is placed on the in- 


Pose are still quite subjective. In its present form, this inventory should be 
used only by persons well trained in counseling or clinical psychology. 


T 


FACTOR ANALYSIS IN TEST DEVELOPME 


In the effort to arrive at a more systematic classification of personality 
traits, a number of psychologists have turned to factor analysis. A series of 
Studies by Guilford and his co-workers represents one of the pioneer ven- 
tures in this direction (cf. 46, 48). Rather than correlating total scores on ex- 
isting inventories, these investigators computed the intercorrelations among 
individual items from many personality inventories. As a by-product of this 
research, three personality inventories were developed: Inventory of Fac- 
tors STDCR, Guilford-Martin Inventory of Factors GAMIN, and Guilford- 

artin Personnel Inventory. After combining two highly correlated factors 
to avoid duplication and redefining other factors, a single 10-factor inventory 
Was Prepared, known as the Guilford-Zimmerman Temperament Survey 
(47), This inventory yields separate scores for the following traits, each score 

ased on 30 different items: 


G. General Activity: Hurrying. liking for speed. liveliness, vitality, production. 
efficiency vs. slow and deliberate, easily fatigued, inefficient. 

R; Restraint: Serious-minded, deliberate, persistent vs. carefree, impulsive, ex- 
Citement-loving. 

4. Ascendance: Self-defense, leadership, speaking in public, bluffing vs. submis- 
Siveness, hesitation, avoiding conspicuousness. 

S. Sociability: Having many friends, seeking social contacts and limelight vs. 
few friends and shyness. 

E. E : imisti 

: Emotional Stability: Evenness of moods, optimistic, composure vs. fluctua- 

tion of moods, pessimism, daydreaming, excitability, feelings of guilt, worry, 
loneliness, and ill health. 

O. 


Objectivin Thick-skinned vs. hypersensitive, self-centered, suspicious, hav- 


INg ideas of reference. getting into trouble. 
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F. Friendliness: Toleration of hostile action, acceptance of domination, respect 
for others vs. belligerence, hostility, resentment, desire to dominate, and 
contempt for others. 


T. Thoughtfulness: Reflective, observing of self and others, mental poise vs. 
interest in overt activity and mental disconcertedness. 


P. Personal Relations: Tolerance of people. faith in social institutions vs. fault- 
finding, critical of institutions, suspicious, self-pitying. 


M. Masculinity: Interest in masculine activities, not easily disgusted, hard-boiled. 
inhibits emotional expression, little interest in clothes and style vs. interest 
in feminine activities and vocations, easily disgusted, fearful, romantic, 
emotionally expressive. 


The items in the Guilford-Zimmerman Temperament Survey are expressed 
in the form of affirmative statements, rather than questions. Most concern 
the examince directly. A few represent generalizations about other persons. 
Three examples are given below: 


You start work on a new project with a great deal of en- 
thusiasm YES g NO 


Tou are offen m low Splits apne peeraa bv oad Ean YES ? NO 


Most people use politeness to cover up what is really “cut- 
throat” competition 


The affirmative item form was chosen in the effort to reduce the resistance 
that a series of direct questions is likely to arouse. In addition, three verifica- 
tion keys are provided to detect falsification and carelessness of response. 

Percentile and standard score norms were derived chiefly from college 
samples. Attention is called to the desirability of interpreting not only single- 
trait scores but also total profiles. For example, a high score in Emotional 
Stability is favorable if coupled with a high General Activity score, but may 
be unfavorable if it occurs in combination with a low General Activity 
score. In the latter case, the individual may be sluggish, phlegmatic, or lazy 
Split-half reliabilities of separate factor scores range from .75 to 85. 
Higher reliabilities would of course be desirable for the differential interpreta" 
tion of individual profiles. Similarly, although an effort was made to obtain 
independent, uncorrelated trait categories, some of the intercorrelations 
among the 10 traits are still appreciable. Originally presented only on the 
basis of its factorial validity, the inventory has subsequently been employed 
in scattered studies of empirical validity, with varied results. 


p 
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vincing. It should be recalled that an element of subjectivity is likely to enter 
into the identification of factors, since the process depends upon an examina- 
tion of those measures or items having the highest loadings on each factor 
(cf. Ch. 13). Hence the cross-identification of factors from different in- 
vestigations using different measures is difficult. Despite the extensive re- 
search conducted by Cattell and his associates over a period of nearly twenty 
years, the traits proposed by Cattell must be regarded as tentative. 

On the basis of their factorial research, Cattell and his co-workers have 
constructed a number of personality inventories, of which the most compre- 
hensive is the Sixteen Personality Factor Questionnaire (24). Designed for 
ages 16 and over, this inventory yields 16 scores in such traits as aloof vs. 
warm, emotional vs. calm, submissive vs. dominant, glum vs. enthusiastic. 
etc. In addition, a "motivational distortion" or verification key is provided 
for one of the forms (C). Owing to the shortness of the scales, reliabilities of 
factor scores for any single form are generally low. Even when two forms 
are combined, several split-half coefficients fall below .80. Available informa- 
tion on normative samples and other aspects of test construction is inade- 
quate. Empirical validation data include average profiles for various occupa- 
tional groups and psychiatric syndromes. 

A similar inventory suitable for ages 12 to 18 has also been developed 
(25), and another for ages 8 to 12 (67). In addition, separate inventories 
have been published within more limited arcas, including anxiety (21). in- 
troversion-extroversion (22), and neuroticism (23). These areas correspond 
to certain second-order factors identified among correlated first-order factors. 
All of these inventories are experimental instruments requiring further de- 
velopment, standardization, and validation. 

Factor analysis provides a technique for grouping personality inventory 
items into relatively homogeneous and independent clusters. Such a grouping 
should facilitate the investigation. of validity against empirical criteria. It 
should also permit a more effective combination of scores for the prediction 
of specific criteria. Homogeneity and factorial purity are desirable goals 1n 
test construction. But they are not substitutes for empirical validity. 


FORCED CHOICE AND THE SOCIAL 
DESIRABILITY VARIABLE 


The forced-choice technique was simultaneously developed by several 
psychologists working in industry or in the armed services during the decade 
of the 1940's (cf. 10, 55, 71, 73, 81 ). Essentially, it requires the subject t? 
choose between two descriptive terms or phrases that appear equally accept 
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able but differ in validity. The paired phrases may both be desirable or 
both undesirable. A tetrad form of item may also be employed, in which 
two desirable and two undesirable phrases are included: sometimes a fifth 
alternative is added, giving a pentad item form. In SUM cases, the subject 
must indicate which phrase is most characteristic and which least characteris- 
lic of himself, Another variant incorporates three phrases. equated for de- 
Sirability, from which the subject must choose the most and the least appli- 
Cable, 

The construction of a forced-choice inventory requires two principal types 
of information regarding each descriptive phrase. viz., its social desirability 
Or “preference index” and its empirical validity or “discriminative index.” 
The latter may be determined on the basis of any specific criterion the inven- 
lory is designed to predict, such as academic achievement or success on a 
Particular kind of job. Social desirability can be found by having the items 
rated for this variable by a representative group, or by ascertaining the Hee 
quency with which the item is endorsed in self-descriptions. In a series of 
Studies on many different groups. Edwards (33) has shown that frequency 
9f choice and judged social desirability correlate between .80 and .90. An 
Other words, the average self-description of a population agrees closely with 
its average description of a desirable personality. M 

Such a relationship cannot be attributed primarily to the deliberate faking 
of Personality test responses, since equally high correlations were found when 
Subjects filled out questionnaires anonymously. To some extent: the eorreln- 
tion reflects a facade effect. or a response set to put up a good front, of which 
the Subject himself may not be fully aware. A more pervasive lack of insight 
Y Out one’s own characteristics and actual self-deception may likewise be 


Involved, It fs also likely, of course, that the social stereotype of desirable 
Char. x 


acteristics in any one culture is aflected by the prevalent behavior pat- 
tern ; 


8 of its members and vice versa. 

In Order to investigate further the relation of the social desirability vari- 
able (SD) to personality test responses, Edwards (33) developed a special 
Social desirability scale. Beginning with 150 heterogeneous MMPI items 
taken largely from the three validating keys, Edwards selected 79 items 
that yielded complete agreement among judges with regard to social desira- 
ility ratings. Through ilem analyses against total scores in this preliminary 
Scale, he shortened thie list to 39 items. This SD scale correlated .81 with 
he Scale of the MMPI, partly because of common items between the 
À ales. Individual scores on this scale can be correlated with scores on 
any Personality test as a check on the degree to which the social desirability 
Variable has been ruled out of test responses. Whatever the cause of the re- 


Wo sc 
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Jation. in so far as social desirability is correlated with test scores the effective- 
ness of the test in discriminating individual differences in other traits is re- 
duced. To be sure, the individual's SD score may be useful in its own right 
for diagnostic or predictive purposes. But the score should be recognized as 
such. rather than being misinterpreted as a measure of some other variable. 

One way of reducing the contribution of the social desirability variable 
in personality test scores is through the use of the forced-choice technique. 
If the paired alternatives are truly equated in social desirability for the popu- 
lation in question, the opportunity for dissembling and faking is minimized. 
This is one of the principal appeals of the forced-choice item form, especially 
in industrial and military situations. At the same time, it cannot be assumed 
that the perceived desirability of items remains unchanged for all purposes. 
The relative desirability of the same items for salesmen or for physicians, for 
example. may differ from their desirability when judged in terms of general 
cultural norms. Thus a forced-choice test whose items were equated in gen- 
eral social desirability could still be faked when taken by job applicants. 
candidates for admission to professional schools, and other specifically ori- 
ented groups. It is possible, of course, to devise more subtle items which can 
be equated in frequency of choice within specific groups. 

The forced-choice technique may also serve to reduce ambiguity and re- 
lated difficulties in the interpretation of items. For example, in the tradi- 
tional "Yes-No" type of item, many subjects experience difficulty in choosing 
a response alternative because their characteristic behavior appears to fall 
between the clear-cut “Yes” and “No” answers. Nor is the inclusion of a “?” 
category a very satisfactory solution, since such a response may indicate 
failure to understand the question, uncertainty, inapplicability of the item. 
or any intermediate degree of the behavior under consideration. The use of 
terms designating amount, degree, or frequency is also open to ambiguity 
of interpretation. It has been demonstrated, for example, that subjects differ 
widely in the meanings which they attach to such terms as “often.” “fre- 
quently," "usually," "rarely." and the like (72). By requiring a relative 
rather than an absolute judgment, the forced-choice technique reduces these 
sources of ambiguity. In this item form, the individual is called upon to state 
only which of two descriptions is more nearly applicable to himself. 

The forced-choice technique was utilized in the development of the Jur 
gensen Classification Inventoy (55, 56). Designed especially for industrial 
application. this inventory provides no general norms or scoring key. Test 
users are expected to develop scoring keys by empirical tryout against local 
criteria. The manual describes the steps that should normally be followed 
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in such criterion keying of test items. and furnishes many helpful suggestions 
and aids to facilitate the work. ; : SPORE 
Essentially this process involves the determination of discriminative in: 
dex for each response and the assignment of scoring weights on this basis. 
These weights can be positive, negative, or zero. depending upon whether 
the given response occurs significantly more often in upper, lower, or neither 
Criterion group. For example, if successful salesmen choose a certain response 
Significantly more often than unsuccessful salesmen, that response Wenig re- 
Ceive a positive weight in the key. In so far as the personality qualifications 
for different jobs vary widely, the practice of criterion keying against specific 
job criteria would seem desirable. Many test users, unfortunately, may lack 
the time or facilities for carrying out the neces 


sary research. 

The Jurgensen Classification Inventory was the first commercially avail- 
able inventory to employ the forced-choice technique. Its items pertain to the 
kinds of persons the subject regards as most and least irritating, the ways in 
Which he would most prefer and least prefer to be considered by others, 
Personal likes and dislikes, preferred activities or modes of behavior, and the 
lypes of persons the subject dislikes. Sample items from this inventory are re- 
Produced below, together with abridged instructions: 


Mark the one which you think is MOST irritating and the one which you think 
i LEAST irritating. 
People who are 


Bluffers z E H : 

Conipitinerss: oo oc nuce SER ra See A Ld KU YN DNA. fe d 

; Interrupters Mu track Goo Eo, dads eire: DM $ 
Decide Which reputation you would MOST prefer to have. and which you would 


4 AST Prefer to have. 
Onsidered 
55 CL NE ERI Aaa NIS Roe ORR Re Acca ge ve een nsu Te ertet ET yd M 
NICHE Mee an, air IMS ET ERN 
M; Friendly LEII 3: ce: : : A 
ark the item you prefer, using XX if your preference is strong, and X if 


it is 
Weak, 
Have interesting work with moderate pay ———— 
ave Uninteresting work with high pay 
Another example of forced-choice technique is provided by the Personal 


V Yentory (PI), developed by Shipley and his associates during World 


3E ILE This instrument veas employed as a psychiatric Screening device 
Y the Navy, Each item consisted of a pair of statements, the subject beine 
"equireg e 


to check the alternative that better described him, The items were 
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paired so as to be approximately equated in social acceptance but sharply 
contrasted in frequency of choice by emotionally maladjusted and normal 
subjects. Psychiatric diagnosis provided the criterion for the latter purpose. 

Two short inventories employing forced-choice items are the Gordon 
Personal Profile (42) and the Gordon Personal Inventory (43). The 
items in these inventories were chosen by means of factor analyses to meas- 
ure eight personality traits. The traits included in the Profile are desig- 
nated as Ascendancy, Responsibility, Emotional Stability, and Sociability; 
those included in the Inventory, as Cautiousness, Original Thinking, Per- 
sonal Relations, and Vigor. In addition, an over-all adjustment score may 
be computed from the Profile. In both tests, each item contains four state- 
ments, corresponding to the four traits being measured. The subject indicates 
which of the four statements is most like him, and which least like him. A 
typical item from the Gordon Personal Profile is given below: 


Able to make important decisions without help. 
Does not mix easily with new people. 
Inclined to be tense or high-strung. 


Sees a job through despite difficulties. 


Retest and internal-consistency reliabilities of the eight factor scores fall 
mostly in the .80's. Some appreciable intercorrelations among factor scores 
remain. Empirical validity data are promising but meager for the Profile 
and as yet unavailable for the more recently published Inventory. 

The most ambitious attempt to measure the effect of the social desirability 
variable in a personality inventory and to control its Operation by means of 
the forced-choice technique is to be found in the Edwards Personal Prefer- 
ence Schedule (32). Since this test also illustrates the use of personality 


theory in test construction, further discussion of it will 


be postponed until 
the next section. 


Mention should likewise be made of the application of the forced-choice 
technique to the construction of rating scales, This type of rating scale h 
been developed for use by supervisors in the merit rating of industrial em- 


ployees, as well as in the evaluation of military officers (10, 73, 81). Al- 
though useful in certain situ 


less effective when applied 


as 


ations, the forced-choice technique has proved 
to ratings than to self-report inventories. One 
reason is that these ratings are assigned in very specific contexts. In addition. 
the experienced supervisors who use such rating scales are better able tO 
"break the key" than is the case with most examinees (59, 81). Owing t° 
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his greater job familiarity, the supervisor is more likely to recognize the 
differential validity or job relevance of alternate items. This opens the way 
for intentional or unintentional bias to affect the ratings. It is just the opera- 
tion of such rater bias that the forced-choice technique was designed to mini- 
mize. There is empirical evidence, however, that supervisors can, when so 
directed, alter the ratings on a forced-choice scale in either a favorable or un- 
favorable direction. Moreover. the more able supervisors tend to be more 
Successful in thus “breaking the key.” 

To be sure, the forced-choice technique may still reduce—although it does 
Not climinate—the operation of rater bias. Moreover, it may be possible to 
formulate more subtle items, whose face validity is approximately equated 
even for experienced supervisors. The forced-choice feats ite simply raises 
the level of sophistication needed to “break the key.” It does not eliminate 
the Possibility of so doing. 


PERSONALITY THEORY IN TEST DEVELOPMENT 


Personality theories have usually originated in clinical settings. The 
amount of experimental verification to which they have subsequently been 
Subjected varies. tremendously from one theoretical system to another, 
Regardless of the extent of such objective verification, a number of personal- 
ity tests have been constructed within the framework of one or another per- 
Sonality theory. Clinically formulated hypotheses have been especially prom- 
Ment in the development of projective techniques, to be considered in 
"hapter 20. Among the personality theories that have stimulated test de- 
velopment, one of the most prolific has been the manifest need system pro- 
Posed by H. A. Murray and his associates at the Harvard Psychological 
Clinic (64). The most comprehensive inventory designed to assess the 
Strength of such needs is the Edwards Personal Preference Schedule (32). 
5 needs drawn from Murray's list, Edwards prepared 
Sets Of items whose content appeared to fit each of these needs. When these 
"tems were administered in traditional “Yes-No” form to a group of college 
Students, frequency of endorsement correlated .87 with the judged social 
“sirability (SD) of the items. As a result, Edwards adopted the forced-choice 
ormat, placing in each pair items that were matched in SD. Several in- 
dependent experiments demonstrated that, when judged in terms of general 
Cultural norms, the SD of items remains remarkably stable in groups differ- 
ing in Sex, age, education. socioeconomic level, or nationality (33). Con- 
Sistent results were also obtained when the judgments of hospitalized. psy- 
Niatric Patients were compared with those of normal groups. 


Beginning with 1 
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Edwards, however, did not recheck the SD scale values of his statements 
when presented in pairs. Later research suggested that the SD values do 
change under these conditions (28). Not only are there significant differences 
in SD scale values of paired items, but a correlation of .88 was found between 
the redetermined scale values of paired items and their frequency of endorse- 
ment. It is also relevant to note that studies on faking indicate that scores on 
the EPPS can be deliberately altered to create more favorable impressions, 
especially for specific purposes (17, 30). The latter possibility, of course, 
exists in any forced-choice test in which items were equated in terms of 
general social norms only. On the whole, it appears that the social desirability 
variable was not as fully controlled in the EPPS as had been anticipated. 

Correlations of the 15 EPPS scores with the social desirability scale. how- 
ever, are lower than those of other inventories. Only two of the EPPS corre- 
lations were significant (at the .05 level) and these were low (.32). With 
such tests as the MMPI and the Guilford-Zimmerman, on the other hand, the 
SD scale yielded a number of correlations between .50 and 80 (32, 33). 


The 15 needs covered by the EPPS, together with abbreviated descriptions 
^f each, are as follows: 


Achievement: 


To do one's best. to accomplish something very difficult or sig- 
nificant. 


Deference: To let others make decisions, to conform to what is expected of one. 


Order: To have regular times and ways for doing things, to keep things neat 
and well organized. 


Exhibition: To be the center of attention. to say witty things or talk about per- 
sonal achievements, 

Autonomy: To be independent of others in making decisions, to avoid responsi- 
bilities and obligations. 
Affiliation: To be loyal, 


to participate in friendly groups. to share or do things 
with friends. 


Intraception: To analyze one's motives and feelings, to observe and understand 
the feelings of others. 
Succorance: To receive help or affection from others, to have others be sympa- 
thetic and understanding. 


Dominance: To persuade 


and influence others, to supervise others, to be regarded 
as a leader. 
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E wi i a . to feel 
Abasement: To feel guilty when one has done wrong, to accept blame. to fee 
timid or inferior. 


. 7 ive ers ec gen- 
Nurturance: To help friends or others in trouble. to forgive others, to be gen 
€rous with others. 


Change: To do new and different things. to meet new people. to take up new 
fads and fashions. ! 
Endurance: To keep at a job until it is finished. to avoid being interrupted while 
hard at work. : 
Heterosexuality: To go out with or be in love with one of the opposite sex. to 
tell or listen to sex jokes. f 
Aggression: To attack contrary points of view, to become angry, to make fun of 
Others or tell them off. 


The inventory consists of 210 different. pairs of forced-choice statements, 
in which items from each of the 15 scales are paired off twice against items 
from the other 14. In addition, 15 pairs are répeated un identical p 
Provide an index of respondent consistency. A profile stability Fi ees a so 

) iduals and even scores in the 15 vari- 
be found by correlating the individual's odd and even scores in > a 
i. d d ape d 
ables Both percentile and T-score norms are provided for co ege men 
ia e ki c a 21 ^ aoee 
(N= 749) a ollege women (N = 760), based on students in 29 colleges 
ar a c o 5 z S nare 1 > F © 
Scattered over the country. Additional percentile norms are provided from a 
e 932 / a ttt 
Beneral adult sample, including 4031 men and 4932 women. Drawn from 
Urban and rural areas in 48 states, these respondents constituted a consumer- 

an g ral areas states, 2 AE EUR A 
Purchase panel used for market surveys (58). The need for specific group 
Norms on personality tests is highlighted by the large and significant mean 

‘ » S Ms : a = ^ cC POO Qe d 
differences found between this consumer panel and the college Sample. The 
Normal percentile chart used in plotting individual scores is illustrated in 
Figu 

Ire 110, 2 x P EUA 

Retest reliabilities of the 15 scales range from .74 to .88; split-half, from 
-60 to 87. Score intercorrelations are satisfactorily low, the highest being 

and i ee close to zero. It might be noted parenthetically that 

a any being clos : i epu i DO E 
any of the at de SOR are negative—a necessary result of the forced- 

c " " € n e B ood " 
“Noice technique, On such a test, it is impossible for an individual to receive 
à high score * low score consistently on all variables. What the profile 
Shows is the relative strength of the different needs. , 
The validity data reported in the manual are so meager and tangential 
^ ae . H p H 3i * 
38 to be virtually neeligible. Since the pubheatroni of the test, however, a num 
er of independent studies have contributed information toward the con- 
Struc : S apad such study (15). subjects were 
truc validation of several scales. In one y J 


m 
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Edwards Personal Preference Schedule 


Adult - Women 
NAME —Me Dougal — Mana. SEX-—E NORMS USED. du 
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The Psychological Corporation, New York 17, New York Printed in U.S.A, 59-111 AS 


Fig. 110. Profile on the Edwards Personal Preference Schedule. (Reproduced by per- 


mission of The Psychological Corporation.) 


put througn three experimental task situations requiring the explicit demon- 
stration of dependent or independent behavior. Subjects who had scored 


high on deference and low on autonomy showed more reliance on others for 
approval and for help. No relationship was found, however, between these 
scale scores and conformity to the opinions and demands of others. Also 
relevant to the construct validity of the various scales are a number of signifi- 


cant group diflerences in mean scores with reference to age, sex, education, 
socioeconomic level, and other demographic variables (57, 58). 

Considerably more information is required for the confident interpretation 
of EPPS profiles in counseling, selection, and other practical applications. 
Further work on the control of the social desirability variable, as well as 
norms on other groups, would also seem desirable. In its present stage, the 


EPPS is a highly promising research instrument which has contributed sev- 


eral ingenious innovations in test construction. 
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EVALUATION OF PERSONALITY INVENTORIES 


It should now be apparent that the construction and use of personality in- 
Ventories are beset with special difficulties over and above the common 
Problems encountered in all psychological testing. The question of faking 
and malingering is far more acute in personality measurement than in apti- 
tude testing. The behavior measured by personality tests is also more change- 
able than that measured by tests of ability. The latter fact complicates the 
determination of test reliability, since the stability of the test is likely to be- 
come confused with broad, systematic behavioral changes (cf. Ch. 5). Even 
Over relatively short intervals, it cannot be assumed that variations in test 
Tesponse are restricted to the test itself and do not characterize the area of 
Non-test behavior under consideration. 

Another problem is presented by the greater specificity of responses in 
the sphere of personality. For example, an individual might be quite sociable 
and extroverted at the office, but rather shy and introverted at formal social 
receptions, Or a student who cheats on examinations might be scrupulously 
honest in money matters. Such specificity is in turn related to the difficulty of 
Srouping items into clearly defined categories or "personality traits." There 
Is Certainly less agreement among the different schemas of classification pro- 
Posed in the personality than in the aptitude area. 

Finally, the search for adequate criterion data for the establishment of 
Validity has generally proved less successful in personality tests. For this 
reason, personality test constructors have sometimes resorted to such make- 
Shifts as correlations with other tests, internal consistency, or content validity. 

To a large extent, the problems cited above are shared by all types of 
Personality tests. But for the present we shall limit our discussion to self- 
report inventories. The acknowledged deficiencies of current personality 
Iventorios may be met in at least two major ways. First, personality inven- 
tories may be recognized as intrinsically crude instruments and their applica- 
oe tre ca ciel ded 
ACcent A d à n i i mice although a fi 75 via 
ie some combination of the two approaches, a 1 gh a few may align 
> "selves exclusively behind one or the other (cf. e.g., 36, 37, 45, 60, 


some Peele illustration of the first approach is provided by the use of a per- 
Pu 7 inventory merely as a springboard for a clinical interview. In such 
Tieni. the interviewer might not boa score Te inventory in the standard 

?r, but might simply examine the subjects answers with a view to 
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identifying problem areas for further probing during the interview. Other 
current practices stemming from a recognition of the pitfalls presented hy 
personality inventories pertain to the interpretation of “poor” versus “good 
scores, and the use of inventories in counseling versus selection. In most 
situations a “poor,” or unfavorably deviant, score is likely to signify malad- 
justment, while a “good” score may be ambiguous. It is also evident that the 
motivation to create a favorable impression is much stronger in the job 
applicant than in the person seeking help from a counselor or in the subject 
of a research project (53). Even in the latter situations, however, complete 
candor cannot be assumed, because of the prevalence of rationalizations, 
defensive reactions, and other façade effects. 

Attempts to improve self-report inventories by direct attack upon the 
major sources of difficulty have been described throughout this chapter. 
Among such efforts may be mentioned the application of factor analysis as a 
means of arriving at more systematic trait categories, the keying of in- 
dividual items against highly specific criteria, the use of a forced-choice 
technique, the development of verification and correction scales, and the 
preparation of “subtle” items whose diagnostic significance is less apparent 
to the respondent. Such items, for example, may present rationalizations 
that have been found to be indicative of certain more basic personality 
traits. It can be readily seen that each of these procedures is directed to- 
ward one or more of the special difficulties outlined above. 

Personality inventories may also be evaluated at a more basic level, in 
terms of their theoretical assumptions and underlying rationale. A whole 
volume could easily be devoted to this discussion. For the present purpose, 
however, a consideration of two frequently recurring and related questions 
will suffice. The first concerns the type of information that personality in- 
ventories are designed to elicit. The second pertains to the inherent ambiguity 
of inventory responses. 

Because the early personality inventories were designed as a rapid sub- 
stitute for the psychiatric interview, it is frequently assumed that the re- 
sponse to each question is an index of the presence or absence of the specific 
symptom or other behavior characteristic described by the question. In the 
light of the usual procedures for selecting test items 


and validating the in- 
ventories, however, 


such an assumption appears unwarranted. As in any 
psychological test, the responses should be Operationally interpreted in terms 
of the criteria against which validity was established (35. 


The distinction between these alternative Ways of interpreting inventory 


items has been repeatedly emphasized in the literature (GF, 6.8... 195.35; 60, 
61). Various terms have been Suggested to differentiate the two types of 
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interpretation. Among them are “factual versus psychological," “veridical 
versus diagnostic," and "literal versus symptomatic." The already familiar 
distinction. between content validity and the various types of empirical 
validity may serve in this connection. In effect. the factual-veridical-literal 
interpretation is based upon the inspectionally determined content validity 
of the questions, while the psychological-diagnostic-symptomatic interpreta- 
tion stems from the empirically established relationships of the inventory re- 
Sponses to various appropriate criteria. As Meehl put it, the personality in- 
Ventory response "constitutes an intrinsically interesting and significant bit of 
Verbal behavior, the non-test correlates of which must be discovered by 
empirical means" (60, p. 297). 

Elsewhere, in characterizing the approach followed in the construction of 
the MMPI, Meehl wrote, 


- the verbal type of personality inventory is not most fruitfully seen as a “self- 
rating” or self-description whose value requires the assumption of accuracy on the 
Part of the testee in his observations of self. Rather is the response to a test item 
taken as an intrinsically interesting segment of verbal behavior, knowledge regard- 
Ing which may be of more value than any knowledge of the "factual" material 
about which the item superficially purports to inquire. Thus if a hypochondriac 
Says that he has “many headaches" the fact of interest is that he says this (61. p. 9). 


In the same vein, Cattell cites the following illustration: 


The questionnaire asks, "Would you enjoy being a sailor in a submarine?" If the 
Subject replies "Yes," one does not assume that he would in fact be happy as a 
Sailor in a submarine. One observes, perhaps, that good librarians as opposed to 

"d librarians more frequently answer "No" to this question, and one uses it em- 
Pirically as an index of librarianship interests or temperament (19, p. 344). 


A self-report inventory is indubitably a series of standardized verbal stim- 
Uli. When proper test-construction procedures have been followed, the re- 
Sponses elicited by these stimuli are scored in terms of their empirically 
established behavior correlates. They are thus treated like any other psy- 
chological test responses. That questionnaire responses may correspond to 
the Subject’s perception of reality (66) does not alter this situation. It merely 
Provides one hypothesis to account for the empirically established validity 
9f certain items. 

In line with the empirical validation of personality test responses as such 
'S the research on response styles (12, 40, 54). Originally investigated in 
Connection with the development of verification and correction keys, re- 
Sponse styles are coming to be regarded as possible diagnostic indicators in 
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their own right, and their validity is being explored from this point of view. 
Among such response styles are defensiveness, susceptibility to facade ef- 
fects, and tendency to choose items in terms of their social desirability. An- 
other example is acquiescence, or the inclination to answer affirmatively 
regardless of the content of the statement. Still another is the tendency to 
choose atypical responses. 


Personality inventories have been vigorously attacked on the grounds that 
their responses are necessarily ambiguous. A classic exposition of this diffi- 
culty was given by Allport, who wrote: 

The stimulus-situation is assumed to be identical for each subject, and his re- 
sponse is assumed to have constant significance. A test will assume, for example— 
and with some justification in terms of Statistical probability—that a person who 
conspicuously takes a front seat at church or at an entertainment should as a rule 
receive a plus score for ascendance. But the fact of the matter is that this person 
may seek a front seat not because he is ascendant but because he is hard of hear- 
ing. Or a test will assume again, with statistical (empirical) justification, that a 
person who confesses to keeping a diary is introverted; yet upon closer inspection 
(which no test can give) it may turn out that the diary is almost wholly an expense 
account, kept not because of introversion but because of money-mindedness. It is 
a fallacy to assume that all people have the same psychological reasons for their 
similar responses. At the level of personality it cannot be said with certainty that 
the same symptoms in two people indicate the same trait, nor that different re- 
sponses necessarily indicate different traits. All mental tests fail to allow sufficiently 
for an individual interpretation of cause and effect sequences (3, p. 449). 


An empirical demonstration of such response ambiguity was provided in a 
study by Eisenberg (34). Personality inventory items of the “Yes-?-No” 
type were administered to 219 college students. Following completion of 
the inventory, the students were asked to write a sentence or two indicating 
why they had answered each item as they did. A survey of these explana- 
tory statements revealed a wide range of interpretations for each response. 
Even more disturbing was the discovery that the identical explanation Was 
sometimes given for opposite answers. For example, to the question "Do you 
like to be alone?" 55 subjects appended explanations indicating that they 
liked to be alone when they had work to do, “but not socially.” Of these, 
18 had marked “Yes” in answering the question, 17 had marked “No,” 
and 20 “?.” 

It is possible, of course, to reduce the fre 
cal responses by formulating items more spe 
for the situation obviously lies in this direct 
advanced, however, th 


quency of ambiguous or equivo- 
cifically and clearly. One remedy 


4 ion. The argument has also bee? 
at item vagueness should be retained, since it allows 


an 
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more free play to individual interpretation, which might in turn reflect char- 
acteristic attitudes, motives, and emotional states (cf., e.g., 35). 

Whether vaguely worded questions yield more or less valid responses on 
Personality inventories can be determined only by empirical correlation with 
appropriate criteria and by other experimental procedures. It is possible 
that for the measurement of certain personality characteristics, questions 
Permitting a certain degree of subjectivity of interpretation are more effec- 
tive. But it would be hazardous to generalize regarding all types of ambigu- 
ous items and all personality characteristics. 

Apart from practical problems of test construction, the prevalence of re- 
Sponse ambiguity is itself of theoretical interest. Why is ambiguity a more 
Serious problem in personality testing than in aptitude testing? The answer 
can be found in the greater standardization of the individual's reactional 
biography in the intellectual sphere (cf. 6). For example, the system of for- 
Mal education in our culture assures relative uniformity of interpretation for, 
let us Say, vocabulary or arithmetic computation items. But no such fund of 
Common antecedent experience is available for the preparation of personal- 
ty test items, This difficulty is similar to that encountered in the construc- 
tion of aptitude tests for infants and young children, who have not yet been 
exposed to a highly standardized educational curriculum (cf. Ch. 11). 

It may be added that the very organization of behavior into traits is af- 
fecteg by the degree of uniformity of the pertinent experiential background 
(cf. 6), Factor pattern analyses of the emotional and motivational aspects of 
behavior have generally yielded trait categories that are less consistent 
and more difficult to interpret than those of aptitudes. This fact has been 
"ply illustrated in this chapter. 

Some psychologists have maintained that, in the domain of personality, 
the individual can bè effectively described only in terms of his own peculiar 

havior interrelationships, rather than in terms of common traits (cf., e.g., 

: This approach represents an extreme reaction to the relatively unstand- 
atdized nature of the emotional and motivational aspects of the individual’s 
"eactional biography. It is undoubtedly true that an intensive study of the in- 

lvidua] case will yield the richest and most precise picture of the person. 

Ut the judicious use of common techniques and normative data should 


Me i. i: i 
Aterially aid such an analysis. 
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CHAPTER ]9 


Measures of Interests and Attitudes 


The strength and direction of the individual's interests, attitudes, motives. 
values. and related variables represent an important aspect of his personality. 
These characteristics materially affect his educational and vocational adjust- 
ment, his interpersonal relations. the enjoyment he derives from his avoca- 
tional pursuits, and other major phases of his daily living. Although certain 
tests are specifically directed toward the measurement of one or another of 
these variables, the available instruments cannot be rigidly classified. ac- 
cording to such discrete categories as interests, attitudes, values, and the 
like. Overlapping is the rule. Thus a questionnaire designed to assess the rela- 
tive strength of different values, such as the practical, aesthetic, or intellec- 
tual, may have much in common with interest inventories. Similarly, such à 
questionnaire might be said to gauge the individual's attitudes toward pure 
science, art for art's sake, practical applications, and the like. 

The study of interests has probably received its strongest impetus from 
vocational and educational counseling. To 
velopment of tests in this area has also been stimulated by vocational selec- 
tion and classification. From the viewpoint of both the worker and the em- 
ployer, a consideration of the individual's interests is of practical significance. 
Achievement is a resultant of aptitude and interest. Although these two vari- 
ables are positively correlated, a high level in one does not necessarily imply 
a superior status in the other. An individual may have sufficient aptitude for 
success in a certain type of activity —educational, vocational, or recrea- 
tional —without the corresponding interest. Or he may be interested in work 
for which he lacks the prerequisite aptitudes. A measure of both types of 


variables thus permits a more effective prediction of performance than woul 
be possible from either alone. 


The assessment of opinions and attitudes ori 


a slightly lesser extent, the de- 


, a 
ginated largely as a problen 
in social psychology. Attitudes toward different groups 
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have obvious impli- 
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Cations for intergroup relations. Similarly, the gauging and prediction of pub- 
lic opinion regarding a wide variety of issues, institutions, or practices are of 
deep concern to the social psychologist. as well as to the practical worker in 
business, politics, and other applied fields. In recent years, the measurement 
Of opinions and attitudes has also made rapid strides in the areas of market 
research and employee relations. 

In this chapter, we shall examine typical standardized tests designed to 
Measure interests, attitudes, and related aspects of personality. As in the pre- 
ceding chapter, attention will be focused upon the paper-and-pencil, verbal, 
group inventory. The majority of interest and attitude measures in current 
as in the 


Use are of this type. It should be noted, however, that in this area 
Measurement of all personality characteristics—other approaches are being 
creasingly explored. A consideration of non-inventory techniques will be 
reserved for Chapters 20 and 21. 


INTEREST TESTS 


. It would seem that the most expedient and direct way of determining an 
'Ndividual’s interests in different types of work, educational curricula, or rec- 
reational activities would be simply to ask him. But there is a vast array of 
data, gathered chiefly in the 1920’s, which shows that answers to direct ques- 
tions about interests are often unreliable, superficial, and unrealistic (cf. 
27, Ch. 5). This is particularly true of children and young people at the ages 
When information reearding interests is especially useful for counseling pur- 
poses, E x 

The reasons for this situation are not hard to find. In the first place, most 
Persons have insufficient information about different jobs, courses of study, 
and other activities. They are thus unable to judge whether they would 
really like all that their choice actually involves. Their interest—or lack of 
terest in a job may stem from a very limited notion of what the day-by- 
day Work in that field entails. A second, related factor is the prevalence of 
Stereotypes regarding certain vocations. The life of the average doctor, law- 
ye Or engineer is quite unlike the versions popularized as television, 

T the less. liter: cazines. The problem, therefore, is that individuals are 
rarely in pict es eO et interests in various fields prior to 
actual Participation in those fields. And by the time they have had the benefit 
or Such personal contact, it may be too late to profit from the experience, 
Since a Change may be too wasteful. A 
For this reason, it was soon realized that more indirect and subtle ap- 
Proaches to the determination of interests would have to be explored. One of 
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the most fruitful of these approaches originated in a graduate seminar on 
interests conducted at the Carnegie Institute of Technology during the aca- 
demic year 1919-1920 (cf. 27. Ch. 3). Several standardized interest inven- 
tories were subsequently prepared as a result of the work begun by their 
authors while attending this seminar. But the one whose development has 
been carried furthest is the Vocational Interest Blank ( VIB), constructed by 
E. K. Strong, Jr. (67, 69). Unlike other carly tests, the Vocational Interest 
Blank has undergone continuing research, revision, and extension. 

The interest inventories developed by the Carnegie group introduced two 
principal procedural innovations. First, the items dealt with the subjects 
liking or dislike for a wide variety of specific activities, objects, or types of 
persons that he has commonly encountered in daily living. Second, the re- 
sponses were empirically keyed for different occupations. These interest in- 
ventories were thus among the first tests to employ criterion keying of items: 
It was found that persons engaged in different occupations were character- 
ized by common interests that diflerentiated them from persons in other 0C- 
cupations. These differences in interests extended not only to matters pen 
taining directly to job activities, but also to school work, hobbies, sports. types 
of plays or books the individual enjoyed, social relations, and many other face 
ets of everyday life. It thus proved feasible to question the individual about pis 
interests in relatively familiar things, and thereby determine how closely his 
interests resembled those of persons successfully engaged in different voca- 
tions. 

The current form of the Strong VIB consists of 400 items grouped into 
eight parts. In the first five parts, the subject records his preference by en 
circling one of the letters L I D, signifying “Like,” “Indifferent,” and “Dis- 
like,” respectively. Each of these five parts is concerned with one of the fol- 
lowing fiye categories: occupations, school subjects, amusements, miscellane- 
ous activities (such as making a speech, repairing a clock, or expressing 
judgments publicly regardless of criticism), and peculiarities of people. The 
remaining three parts of the VIB require the subject to rank given activities 
in order of preference, compare his interest in pairs of items, and rate his 
present abilities and other characteristics. 

The blank is scored with a different key for each occupation. To date, 47 
occupational keys are available for scoring the men’s form, and 28 for the 
women’s form. New keys are developed from time to time, as data on other 
occupational groups are gathered. In the development of these occupational 
scoring keys, the responses of persons successfully engaged in each occup?" 
Hon. NES compared with those of “men-in-general” (or “women-in-ge™ 
eral”). The general reference group consisted of a representative sample ° 
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Professional and business men.! The choice of reference group was based 
upon the fact that most of the occupational keys dealt with professions and 
higher business positions. The use of a reference group representative of the 
total male population proved less effective in bringing out the differentiating 
Interests of the individual occupations. The interests of professional and 
business men as a group differ so much from those of skilled laborers that 
the differences between one high-level occupation and another were obscured 
When the more general reference group was employed. For the same rea- 
Son, the few available keys for lower-level occupations are not as discrim- 
Mative as they might be if the reference group had been closer to the socio- 
Economic level of the occupations concerned (68, Ch. 21 and 22). 
It should be noted, however, that other efforts to develop vocational in- 
terest Scales for lower-level occupations have failed below the skilled-trades 
level (17). Workers in semiskilled and unskilled jobs seem to be as inter- 
changeable with regard to interests as they are with regard to abilities. There 
'S also extensive evidence to show that at higher levels of the occupational 
lerarch professional, managerial, etc.) job satisfaction is derived chiefly 
from Mes iei for the work, while at the lower levels there is increasing 
reliance on such extendi factors as pay, security, social contacts, and rec- 
Ognition as a person (17). The measurement of vocational interest patterns 
thus becomes less relevant as we go down the occupational hierarchy. 
In the development of the Strong VIB keys, the relative frequencies of 
ee response among, let us say. engineers and do ec i ci 
Ae weieht oiva 2 sponse in the engineer key. These weights can 
vty ht ipd as mee weight indicates that the response occurs 
More frequently nang engineers than among men-in-general, the greater the 
difference in frequency the higher the weight. A zero weight means that the 
"sponse fails to differentiate engineers from men-in-general. And a negative 
Weight is assigned to responses obtained less frequently from engineers than 
‘Tom Men-in-general. The computation of scoring weights for the first 10 
‘tems on the engineer key is illustrated in Table 24. It will be noted that an L 
"esponse to the item “author of a technical book" receives a weight of +3, 
N lle a D response to the same item has a weight of —2. An L response to 
“Uctioncer,” on the other hand, is scored —1; and a D response, +2. 
: Parenthetically, it might be objected that in Part 1 of the Strong test. from 
Which the items in Table 24 were taken, the subject is asked to express his 
Mterest in many occupations about which he may have little knowledge. But 

ese responses, like all others in the test, are evaluated in terms of their 
a The Women-in-general group included samples of professional and business women, as well 


relatively large representation of housewives (68, p. 714). 
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empirically established correlates, rather than as a direct index of interest in 
the occupation specified. This interpretation will be recognized as tig 
“symptomatic” treatment of responses which was discussed in the preceding 
chapter. What may often be obtained in these occupational preference items 
is the individual’s response to the social stereotype evoked by each occupa- 
tional name. Whether or not these responses have differential significance for 
different criterion groups is then empirically determined. 


TABLE 24. Determination of Scoring Weights for Strong Vocational Interest Blank: 
Sample Items from Engineer Key 
(From Strong, 68, p. 75) 


Percentage of Men Giving " EN 

y Each Response among: Differences in UOTE 

First Ten Items SY Percenta ze beween or eanet 

Dea uen General" Engineers Me CERE Scale 

ES ep [ode Kies pallies tee 

Actor (not movie)| 21 32 47 9 31 60 —12 —1 413] —1 0 : 
Advertiser 33 38 29 14 37? 49 | —19 — f 29| —2 0 i 
Architect 37 40023 1:558 32 10 | 42] — m Lie 2 1 P 
Army officer 22. 99 49 | 31 33-36 | 4 9. 4 4- -—18 EA 
Artist 24 40 36 | 28 39. 33 | Ea —1 3| 0 9 i 
Astronomer 26 44 30 | 38 44 18 | 412 Oo —12 Day Ti 
Athletic director | 26 41 33 | 15 51 34 | —1] 410 4+ 1|-—-1 1 x 
Auctioneer 8 27 65 | 76 8$ | 7 V est | —L el 0 
Author of novel 32. 38 30 | 22 44 34 | —10 4-6 4 4| —1 ! 
Author of techni- 

cal book a at. o $58 32. € | 428 — € 2 n. San 


The total score on each Occupational key is simply the algebraic sum of 
all item weights. Percentile and standard score norms are reported for "di 
occupational group. To simplify interpretation, Strong also gives fete: 
ratings corresponding to certain portions of the standard score distribution- 
Thus an “A” rating represents a score at or above —1⁄ SD, i.e., a point above 
which approximately 69 per cent of the Occupational group in question 
scored. Similarly, the lower boundary for a “B” rating is set at —2SD, and 
scores below —2SD are rated “C.” A rating of *C" would thus mean that the 
individual obtained a lower score on that particular key than about 98 per 
cent of the corresponding occupational sample. It must be borne in mind 
that a high score on any VIB key simply indicates close resemblance to the 
interests of persons engaged in that occupation. The test does not. of course» 
attempt to measure aptitudes for any vocation. 

An individual's Vocational Interest Blank could be scored for a single 0° 
cupation. For selection purposes, for example, we might want to know how 
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closely the applicant’s interests resemble those of successful real estate 
Salesmen. More often, however, the blank is scored with most—if not all— 
available keys. With the resulting set of scores. it is possible to examine the 
Patterning of the individual's interests. This type of analysis provides a more 
dependable prediction of the individual's reaction to any given type of 
Work. For counseling purposes, a consideration of the entire interest pro- 
file is of prime importance. 

Group keys are also available for certain interest clusters. These clusters 
Were identified from the correlations of scores on the various occupational 
keys and were later corroborated by factor analysis (cf. 68. Ch. 8, 9, 14). 
For example, one cluster—apparently characterized by a common interest in 
“uplift” or social betterment—includes YMCA physical director, personnel 
Manager, public administrator, YMCA secretary, social science high school 
teacher, city school superintendent, and minister. Another group comprises 
Physicist, chemist, mathematician, and engineer. And still another consists of 
Sales Manager, real estate salesman, and life insurance salesman. Similar 
BrOüpings are provided for the women’s form. But owing to the smaller 
number of available keys, the classification of women's occupations is 
Much more tentative than that of men’s occupations. 

A list of the men’s occupations falling within each of the groups identified 
to date can be found on the Report Blank, reproduced in Figure 111. This 
blank shows the standard scores corresponding to each letter rating. In addi- 


lion, the stippled area across from each occupation indicates the range of 
Scores that are likely to result from chance on that occupational key. This 
range Was determined by dice throwing and centers around raw scores of 
zero, The stippled area covers +1 SD, or the middle 68 per cent of scores 
ID the Chance distribution. Scores falling within the stippled area signify 
Neither clear-cut agreement nor disagreement with the interests of persons in 
the given occupations. For illustrative purposes. the profile of a group of 
Medica] Students has been plotted in Figure 111. This profile shows the 
mean Standard score of 47 medical students on each of the occupational 
Sys available at the time of testing. 

Ti general, an examination of the individual's score pattern on separate 
Pational keys is to be preferred to the use of group scales, since the lat- 

È are not equally representative of all their constituent Occupations (17). 
cor Available electronic scoring equipment, ids eee feasible to obtain 
ha. 9n all keys within a very short time; For certain occupations, the na- 
aie the criterion group employed in developing the key should also be 
Eton ly considered. Two studies, for example, have revealed significant 
P differences in VIB performance among several different types of sales- 
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Report on Vocational Interest Test for Men 
(See other side for explanation) 
Date 


Age Agency or school 


Occupation 


L| Arist 
Paycholoast ev) 29 
Architect 


$ 
heeren p | 46 

i 

| 

t 


Paychiatrist 
Osteopath 
Dentist 
Votorinarian 
T | Physicist 
Chemist 
Mathomatinan 


Engineer 
TH [ Production Vanaget 
Vv 


s 


Farmor 
Carpenter 

Printer. 

Math Set Teacher. 


Ea 


Policeman 


Forest Service 


Army Other 


[Aviator E 
V|YMCA Phys Dir 
[Personnel Monson 
| Public Administrator | — 
Vocational Counselor | 
YMCA. Secretary 
| Soc. Sci. Teacher 
City School Sup | 
— [Mininor 


ul 


Musician 


CPA, Partner 


Senicr CPA 


Junior Accountant 


[Ottico Werkor 


| 
i 
T 
Purchasing Agent | 
Banker [ 

f 


X | Advontising Man 


[Lawyer 


Authorfournaitst — | 
XI [ President 


| 


Masculinity Femininity] 


TOscupancnat Lever T 1 
T 


Specialization Level | 


HH 


Interes! Maturity 


: 


ALAS] LUI IE 


scm 


ET E 


bs ] 


| 
: 


Fig. 111. 


Profile of Medical Students on the St 


ing Mean Standard Scores of 47 Students. Gro 
v, VIII. IX. and X. (Reproduced by permissio 
Strong. 68. p. 421.) 


up scales are available for Clusters 


n of Stanford University Press; data fr 


, 2 w- 
rong Vocational Interest Blank, Sh° 


from 


Measures of Interests and Attitudes 535 


men, in addition to the three broad categories for which separate keys are 
provided (38, 86). Such findings are in line with the trend away from the con- 
cept of a general sales personality. 

In addition to the occupational keys. the VIB can also be scored with 
Special keys for interest maturity, masculinity-femininity, occupational level, 
and specialization level. The first of these scales differentiates between the 
interests of 15- and 25-year-old men. Beyond 25, little change in interest 
Scores has been found. Changes are most rapid between 15 and 20. For 
this reason, scores on the occupational scales are likely to be too low in 
the case of men under 20 (cf. 68. Ch. 12). 

The Masculinity-femininity scale shows the degree of similarity of the in- 
dividual's interests to the interests characteristic of men or of women, respec- 
tively, The occupational-level scale measures the difference between the 
interests of laboring men, on the one hand. and those of business and pro- 
fessional men, on the other. Mean occupational-level scores range from 64 
for lawyers to —44 for unskilled laborers. This scale has been interpreted by 
Some investigators as an index of level of aspiration, motivation, or status 
drive (17, pp. 115-118). 

The most recent addition to the group of non-occupational scales consists 
of a scale for measuring specialization level (71). Originally designed for 
differentiating between the interests of specialists and general practitioners in 
Medicine, this scale proved to have discriminative value in other fields requir- 
Ing advanced specialized study. It was therefore released for general use 
among college men, with the tentative explanation that it indicates whether 
9r not the individyal would enjoy advanced study of a type involving nar- 
Tow Specialization. Other scales of more restricted application have been con- 
Structed for special purposes. Examples include scales for differentiating 
Specialties within the fields of medicine. psychology. and engineering (70, 
Pp. 159-161). In the construction of such specialty scales. the reference 
&roup consists of members of the occupation as a whole, rather than the men- 
"- generat sample. 

Odd-even reliabilities of the VIB scales average .88. With one exception, 
all fal above .80. Long-range retests of men originally tested in college have 
Shown good stability. Over intétvals of approximately 18 years, the median 
M correlation was .69 (70, p. 63). The scales correlated were those cor- 

Sponding indivi '« actual occupation at the time of the follow-up. 

Ose ey qa aM TUAE S sins also exhibited the kit 
is long-range stability: The stability of interest profiles was also checked. For 

5 Purpose, each individual's scores on 34 scales. obtained on the initial 


les 
St, were correlated with his retest scores on the same 34 scales. Over a 22- 
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year period, the median correlation for all individuals was .74. again ane 
dicating remarkable stability (70. pp. 64-65). In line with the increasing 
interest in response style, some attention has been given to the relative fre- 
quency of likes (L) and dislikes (D) marked by an individual on the VIB. 
Conspicuous differences in this regard have been found among occupational 
groups, as reflected in the corresponding keys. That such a response tendency 
is a fairly stable individual characteristic was demonstrated in an analysis of 
the Physician key responses of 71 physicians tested in 1927 (as college 
students) and again in 1949 (80). Over this 22-year period. the L scores 
correlated .54 and the D scores .62. both coefficients 
beyond the .01 level. 

From the viewpoint of validity, follow- 
considerable correspondence between 
choice of occupation. 


being significant well 


up studies have also indicated 
initial interest scores and eventual 
In terms of expectancy ratios (cf. Ch. 7), the chances 
are 78:22 that a man with an A rating in an occupation will enter that occu- 
pation, and the chances are 83:17 that a man with a C rating will not enter 
the occupation (70, P- 42). These ratios were computed by checking VIB 
scores of 663 students with the Occupations in which they were engaged 18 
years later (70, p. 43). Indices of job satisfaction yielded low but significant 
correlations with interest scores in the in 


dividual’s own occupational field. 
These correlations were 


about as high with initial interest scores (18 years 
earlier) as with present retest scores (70, p. 114). All in all, the VIB repre- 
sents one of the most successful a 
intellectual variables. 


Another widely used interest test is the Kuder Preference Record— Voca- 
tional (40). Developed more recently than the VIB, this test followed a 
different approach in the selection and scoring of items. Its major purpose 
was to indicate relative interest in a small number of broad areas, rather than 
in specific occupations. The items were originally formulated and tentatively 
grouped on the basis of content validity. This was followed by extensive item 
analyses on high school and adult groups. The object of such item analyses 
was the development of item groups showing high internal consistency and 


low correlations with other groups. This aim was reasonably well fulfilled for 
most of the scales, 


The items in the Kuder— Vocatio 
For each of three activities listed, t 
like most and which he would like 


pproaches to the measurement of non- 


nal are of the forced-choice triad YP 
he respondent indicates which he wou 


: Mustrated 
least. Two sample triads are illustrate 
. or " ; r " for 
in Figure 112. The test provides 10 interest scales plus a verification scale fo 


detecting carelessness and failure to follow directions. The interest scales 


include: Outdoor (agricultural, naturalistic), Mechanical, Computational. 
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Visit an art gallery . 
i Leost 
Browse in a library . ES 


Visit a museum 


Collect autographs . 
Collect coins. 


Collect butterflies. 


In the first triad, the subject has punched a hole to the leff of R, indicating that 
he likes most to visit a museum. He has punched a hole to the right of Q, 
indicating that he likes least to browse in a library. 

The second triad has been similarly marked. 


Fig. 112. Sample Items from Kuder Preference Record— Vocational. (Reproduced 
Y permission of Science Research Associates.) 


Scientific, Persuasive, Artistic, Literary. Musical, Social Service, and Clerical. 
Separate sex norms are available for high school, college, and adult groups. 
Total scores in the 10 interest areas are plotted on a normal percentile chart, 
4s illustrated in Figure 113. 

The reliabilities of the Kuder scales, as determined by the Kuder-Richard- 
Son technique, cluster around .90. Stability over intervals of about a year or 
less also appears to be satisfactory. Little information is available regarding 
Stability over longer periods. There is some evidence to suggest that, espe- 
Cially in the Mes of high school students, shifts in high and low interest 
arcas are relatively frequent when retests are several years apart (34. 49, 
55, 60), Studies on the simulation of interest scores have shown that faking 
is Possible to some extent on both the Kuder and the Strong; but visibility 
is Somewhat ereater on the Kuder, owing to the more obvious nature of its 
tems (20, 47), 

The manual for the Kuder Preference Record provides an extensive list of 
Occupations, grouped according to their major interest arca or pair of interest 
arcas, For example, radio operator is classified under Mechanical; landscape 
architect, under Outdoor-Artistic. This is an a priori listing in terms of logical 
°F content analysis. In addition, a growing list oF empirically established oc- 
“UPational profiles has been included in successive revisions of the manual. 

ne data for these average occupational profiles have been contributed 
argely By fest ers. Consequently. many of the groups are small and their 
“ePresentativeness ör comparability is often questionable. Some attempts have 
een made to work out a coding system for Kuder profiles (10, 26, 85), 
Similar to that developed for the MMPI. This suggested coding of occupational 
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profiles is based upon empirically established mean scores for each group, 
rather than upon a priori classifications. Efforts have also been made to de- 
velop equations for finding total scores for specific occupations or other cri- 
terion groups. 

More recently, the Kuder Preference. Record—Occupational (42) has 
been developed through criterion. keying procedures similar to those fol- 
lowed in the Strong VIB. Incorporating many items drawn from the earlier, 
Vocational form, the Kuder— Occupational currently provides 38 occupa- 
tional scores plus a verification score. Some of the scales are designed for 
Very specifically defined occupational groups. as illustrated by county agri- 
cultural agent, counseling psychologist. department store salesman, and radio 
Station manager. Kuder-Richardson reliabilities of individual scales are 
lower than in the Kuder— Vocational, partly because empirical validity 
Against occupational criteria was given priority over internal consistency in 
item selection, Median retest reliabilities of .79 and .86 are reported for 
high school and college samples, respectively. No evidence for long-range 
Stability is given, however. Although the occupational keys were cross-vali- 
dated on new samples, the available data pertain to concurrent validity only, 
and not to predictive validity. 

It will be noted that recent developments in connection with both the 
Kuder ; ; tories have been such as to reduce the initial dif- 
enlaces, UR of these two instruments to the measure- 
Ment of interests. On the one hand. occupational-interest clusters, group 
Scales, ang profile analysis have broadened the interpretation of VIB scores. 
On the other hand, occupational scores have been derived for the Kuder— 
Vocational; and the newly developed sudes Occupation. has substituted 
sum mE against pesti ies EUN for internal consistency of items 

adly defined interest areas. 

Still another member of the Kuder 


"der Preference Record—Personal (41). à 
the Kuder— Occupational, this inventory has not been widely used and should 


e si i ane d aj sing i aak a x 
.- COnsidered as an experimental instrument. Again using the triad item form, 
ore and five scores designed to show relative 
(Sociable), for familiar and stable situations 


family of interest inventories is the 
Although developed earlier than 


' Provides a verification sc 
Preference for being in groups : ! oe 

Tactical), for dealing with ideas (Theoretical), for avoiding conflicts 

Breeable), and for directing others (Dominant). It can be seen that this 
55t cuts across traditional interest areas and some of the personality factors 
described in the preceding chapter. Correlations with other interest and 
Personality tests, however, raise doubts about the identification of the char- 
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acteristics measured by the five scales. Certainly the scale labels should not 
be interpreted literally without considerably more construct validation. . 

Although the Strong VIB and the Kuder— Vocational are the most widely 
used instrumients for the measurement of interests, many other inventories 
have been developed. Some are specifically directed toward the appraisal of 
educational or recreational interests (cf., e.g., 32; 72; 73, Ch. 18). Others 
have a predominantly vocational slant, like the Strong and Kuder—Occupa- 
tional. Still others deal with relatively broad interest areas, like the Kuder 
— Vocational. Well-known examples include the Thurstone Interest Sched- 
ule (76). the Guilford-Shneidman-Zimmerman Interest Survey (32). and 
the Occupational Interest Inventory prepared by Lee and Thorpe (45). In 
all of these inventories, items were selected and grouped on the basis of their 
apparent content validity, although internal consistency procedures were fol- 
lowed in the further refinement of scales. All should be regarded as pre- 
liminary or experimental instruments. 

Corroboration for several of the interest areas covered by the Guilford- 
Shneidman-Zimmerman Interest Survey was later provided by an extensive 
factor analysis of interests conducted by Guilford and his associates (30)- 
Based upon the intercorrelations of 95 ten-item interest tests, this study sam- 
pled an unusually wide range of interests. The number of subjects was large. 
including 600 airmen and 720 officer candidates in the Air Force, for PIRE 
Separate correlation matrices were computed and analyzed. Of the 24 factors 


identified for airmen and the 23 for officers, 17 were common to both groups. 
These common factors were described as follows: 


Mechanical Interest 
Scientific Interest 

Adventure vs. Security 
Social Welfare 

Aesthetic Appreciation 
Cultural Conformity 
Self-Reliance vs. Dependence 
Aesthetic Expression 
Clerical Interest 


Need for Diversion 
Autistic Thinking 

Need for Attention 
Resistance to Restriction 
Business Interest 
Outdoor Work Interest 
Physical Drive 
Aggression 


-rOommouowrx 
esoz£rnc- 


Several of these factors Suggest the role of culture in stru 
patterns. A number follow traditional 
tic of our society, as illustrated by M 
Clerical. and Business interests. To 


cturing interest 
1 " anteris- 

occupational categories character! 

echanical, Scientific, Social Welfare- 


a lesser extent, this is also true of ACS- 
thetic Appreciation, Aesthetic Expression, and Outdoor Work interests. Ce" 


tain factors are strongly reminiscent of traits measured by such personal- 
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ity inventories as the Edwards and the MMPI. More will be said about the 
Overlap of interest and "personality" traits in the last section of this chapter. 


OPINION AND ATTITUDE SURVEYS 


An attitude is often defined as a tendency to react favorably or unfavor- 
ably toward a designated class of stimuli, such as a national or racial group. a 
Custom, or an Bett dom. It is evident that, when so defined, attitudes cannot 
be directly observed. but must be inferred from overt behavior, both ver- 
bal and non-verbal (29. Ch. 9; 48). In more objective terms, the concept of 
attitude may be said to connote response consistency with regard to certain 
Categories of stimuli (11). In actual practice. the term “attitude” has been 
Most frequently associated with social stimuli and with emotionally toned 
responses. 

Opinion is sometimes differentiated from attitude, but the proposed dis- 
tinctions are neither consistent nor logically defensible. More often the two 
terms are used interchangeably, and they will be so employed in this discus- 
Sion, Common usage. however has popularized the expressions “opinion 
polling" and "attitude scales" to indicate two distinct. methodologies em- 
Ployed in attitude-opinion surveys (48). Although arbitrary. the association 


of opinions with polls and of attitudes with scales has become conventional. 


Opinion polling represents 
Usually in the form of "yes" o 
Often included. Sometimes a larger number of respor 
vided. In other cases, the subject may be asked to rank items in order of 
Preference, Under special circumstances. the questions. may be of the 
ch the subject is free to formulate the answer in 


a single-question approach. The answers are 
r "no," although an "undecided" category is 
nse alternatives is pro- 


Open-end" type. in whi 
'5 Own words. Prior to tabulation. the 1 c 
Must be coded. or classified on the basis of their essential content. Regardless 


o : 
i the form in which the questi 
Sults 


answers to such open-end questions 


ons and answers are expressed, the final re- 
are reported in terms of the percentages of persons giving each type 
of answer, 

Attitude scales, on the other hand. 
"sponses to a series of questions pertai I 
9 be sure, opinion polls may also contain more than one question. But the 
replies to these questions are kept separate rather than being combined. In 

© construction of an attitude scale. moreover. the different questions are 
“signed to measure a single attitude. or unidimensional variable, and some 
e usually followed in the effort to approximate this 


yield a score based on the individual's 
ning to the issue under investigation. 


[n h 
ioe procedures ari i eMe. 
a i » E S her hand. ma 3 ite 

- The questions used in opinion polls. on the othe nd. may be quite 
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unrelated. Attitude scales are also characteristically concerned with intensity 
of response. ; 

Both polls and attitude scales have been widely used for a variety of pur- 
poses. Public opinion research is a well-known example (12, 48, 50. 56). 
Besides the familiar and controversial election forecasts undertaken by come 
mercial pollsters, nationwide polls are carried out regularly on many social, 
political. economic, and international questions of popular interest. The ob- 
ject of such surveys is similar to that of any other procedures for gauging 
public opinion, and the resulting information can be put to a number at 
practical uses. The large-scale nature of these surveys and the rapidity with 
which answers are required have encouraged the utilization of polling, 
rather than attitude scale, techniques. But attitude scales have also found 
a place in public opinion research. An example is provided by “public 
morale surveys,” such as that conducted by Rundquist and Sletto (61) dur- 
ing the depression years of the 1930's. Attitude scales have likewise been 
employed to explore the attitudes of more narrowly defined publics, such as 
farmers or persons living in certain areas. 

Market research (5) has much in common with public opinion studies. Be- 
cause of similar practical demands, polling techniques have served as the 
principal tool in both fields. The consumer public, or potential consumer pub- 
lic, to which market research is directed can usually be more specifically de- 
fined than the population to be sampled in public opinion polls. Such à 
consumer public would of course vary somewhat with the nature of the par- 
ticular commodity under consideration. The object of market research is t? 
investigate consumer needs and reactions in reference to products, services, 
or advertisements. The resulting information may be used for such purposes 
as choosing the most effective advertisement for a specific article, improving 
a type of service, preparing a new model, or designing a new product to 
meet consumer specifications. 

Another major field of application is to be found in the measurement of 
employee attitudes and morale (6, Chs. 3, 4, 5; 51). Both single-question and 
scale procedures have been utilized for this purpose. Some surveys are COT 


cerned with an over-all estimate of favorable or unfavorable attitude toward 
the job or the company. More often, the investig 
titudes toward different aspects of the 
instrument designed for the | 


ation yields a profile of at- 
job situation. An example of an 


; atter purpose is the SRA Employee Inventory 
(8). which consists of 78 items chosen to sample attitudes in such areas 2? 
job demands, working conditions, pay, supervisor-employce interpersonal re- 
lations. adequacy of communication, and identification with the company: 


Subsequent factor analyses of this inventory, however, have failed to SUP” 
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port the classification of items into the original categories (2. 3. 84). Either 
the total score on job attitude or revised keys on the objectively identified 
factors should be used. In the more carefully conducted employee attitude 
Surveys, it is customary to construct special instruments for use in a given 
company. It is thus possible to tailor each question to local conditions and to 
Obtain reactions to more specific characteristics of the particular job situa- 
tion. 

Attitude surveys are also employed to check the effectiveness of educa- 
tion and training. They may. for example. provide an index for evaluating 
different instructors or instructional practices. The Survey of Student Opin- 
ions administered by the Air Force in its navigation training program illus- 
trates such an application (14). To be sure, if misused by supervisors as a 
basis for criticism of subordinates or for other unfavorable administrative 
action, these procedures may seriously undermine morale. But if used wisely 
and constructively. they can provide information that will be helpful to the 
individual instructor. Attitude surveys have also been utilized in measuring 
the changes in student attitudes toward literature. art. different racial or 
Cultura] er OS IUE. economic questions. Or other pertinent matters, 
ee ea study or a educational program (cf.. e.g., 
64). E i i x 

One of the earliest and most extensive applications of attitude surveys is 
to be found in research in social psychology. personality theory, and related 
areas. Practically every textbook on social psychology contains sections on 
attitudes and their measurement. Sociologists are likewise concerned with 
Attitudes, although some are skeptical regarding the feasibility of quantifi- 
Cation and objective measurement in this area. Among the many problems 
Investigated through attitude surveys may be mentioned group differences in 
attitudes, the foléxef attitudes in intergroup relations, background factors as- 
Sociated with the development of attitudes, the interrelations of attitudes 
Cineluding factor analyses). trends and temporal shifts in attitudes, the 
€Xperimental alteration of attitudes through interpolated experiences, and 
Parent-child relations. Two instruments that have served as a basis for an 
a variety of psychological problems 


un € ce 
Usually large amount of research on | gical p 
a measure of authoritarianism (59, 


are the California F scale designed as 
79) and the Parental Attitude Research Instrument (PARI). constructed by 
Psychologists in the Child Development Section of the National Institute of 


ental Health (62). 


d he measurement of attitudes is SUE 
bate, Whether verbally expressed opinions can be regarded as indicators 
of been questioned. In part, this problem con- 


a subject of recurrent controversy and 


real” attitudes has frequently 
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cerns the relationship between verbal and non-verbal overt behavior. In 
other words. does the individual suit his actions to his words—or to his 
attitude scale score? Discrepancies between verbally expressed attitudes and 
overt behavior have been noted in a number of studies (16, 44). In an inves- 
tigation on college students, for example. a correlation of only .02 was 
found between scores on an anonymous but secretly coded scale of attitude 
toward cheating and actual cheating behavior in scoring examination papers 
(16). Considerable cheating was found despite the strong attitude against 
cheating professed by the group as a whole. There is, of course, little room 
for diversity of socially acceptable opinion about cheating. a fact that may 
account in part for the lack of correspondence between verbal and non- 
verbal behavior in this situation. 

It has been further pointed out that even observations of overt behavior 
may not always provide an accurate index of attitude. For example. an in- 
dividual may both profess strong religious beliefs and attend church regu- 
larly. not because of his religious convictions but as a means of gaining social 
acceptance in his community. Such a possibility raises the further question of 
the relationship between "public" and "private" attitudes. How do the indi- 
vidual's publicly expressed attitudes compare with the opinions he voices 
in conversation with intimate friends or with the stranger in the club car 
whom he never expects to meet again? Public opinion surveys are usually 
"public" in more than one sense. They represent a verbal expression of atti- 
tudes by the public and to the public. 

To some extent, anonymous expressions of attitudes may provide a closer 
approximation to private attitudes; but the two cannot be assumed to be 
identical. Moreover, the individual's verbally expressed attitudes, even when 
reported "privately" or anonymously, may sometimes differ from his gen- 
eral, unvocalized attitudinal responses. The latter represent vague feelings ot 
other implicit reactions that have not been overtly verbalized by the in- 
dividual. 

The relationship between “what the person says” and “what he does." 
as well as the relationship between publicly and privately expressed attitudes: 
will be recognized as special instances of validity. Attitude scales and opinion 
polls may be validated against a number of criteria, such as membership m 
contrasted groups, ratings by close acquaintances, and biographical data sc- 
cured through intensive interviews or case studies. Because of the practical 
difficulties i obtaining such criterion data, however, investigators have fre- 
quently relied upon the familiar makeshifts involving validation by internal 
— math another attitude scale. All too often, reso" 
has been made to a superficial kind of content validity based upon the 


consistency or by correl 
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examination and classification of questions according to the topics covered. 
For opinion polling. validation is rarely attempted at all. 

It is admittedly true that the validation of attitude measures presents à 
difficult problem. In most practical situations. the validity concept can be re- 
duced to a question of how far one can generalize from test results. Many 
attitude or opinion surveys are conducted for the stated purpose of systemati- 
cally exploring verbally reported attitudes. In such a case, the criterion 
should itself be defined in terms of verbally expressed attitudes. In other 
instances, very different. purposes are to be served by the survey. More at- 
tention, however, should be given to the explicit formulation of the objectives 
as to the precise definition of the criterion. And as a 


of each survey, as well 
in the development of validation. proce- 


Prerequisite to further advances 
itude needs additional clarification. 


dures, the very concept of att 
especially with reference to polling. 


Data on reliability are also scanty. 
Yet in the case of polling techniques. 
tions, reliability is most likely to be suspect. 


Attitude surveys also present à number of methodological problems. 


with their reliance upon single ques- 


These problems are not fundamentally different from those encountered in 
the construction and administration of other types of psychological tests. but 


they are accentuated in the measuremen 
Center about the formulation of questions. 
and the procurement of an adequate sampling of the population (cf. 48). 
Proper formulation of questions is especially important in polling tech- 
Piques, owing to their reliance upon single questions and to the usual lack of 
5 ng and item analysis. It has been repeatedly dem- 
t y be substantially altered by changing the 


t of attitudes. The major difficulties 
the administration of the survey. 


time for extensive pretesti 
p. et that survey results ma 
om in which questions and an 
Tules” for good question writing 
factors to be considered are ambiguity, leading and loaded questions, un- 
familiar terms, confusing and complex wording, the use of negatives and 


altern 


swers are expressed (cf. 12. 48). Many 
have been listed. Among the pertinent 


do :ve answers. and sf 3 ie 
rani ative answers, and the form in w 
uble negatives, the number of ative a a hich 


"esponses are given. The results obtaine 
€ affected by its context, as determined by 
marks, or stated sponsorship of the survey. 
in Fascist or Communist, may alter the subje 
"espond to the emotionally toned stereotype rather than to the specific con- 

tent of the question. 
same conditions under which th 
eral] S. In surveys of employee a 
Y taken to insure anonymity © 


d with any one question may likewise 
preceding questions, opening re- 
References to stereotypes. such 
cts response, since he is likely to 


e survey is conducted may also influence the 
ttitudes. for instance. precautions are gen- 


f replies by having unsigned questionnaires 
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dropped into a ballot box or mailed to an outside consulting organization. 
For the same reason, interviews are often conducted by persons not bod 
nected with the company and unacquainted with individual workers. When: 
ever feasible, anonymity is a desirable condition in most types of attitude 
surveys, because it encourages frankness and is more likely to evoke "pri- 
vate" attitudes. Similarly, whether the responses are obtained individually or 
in groups. in face-to-face oral interviews or in writing, in person or by ele 
phone. represent important procedural questions. Each of these techniques 
has its characteristic advantapes and disadvantages. But it cannot be as- 
sumed that all these methods yield interchangeable results. 


The role of the interviewer has itself been the subject of extensive re- 


scarch. For certain types of Surveys, such as those employing unguided and 
"depth" interviews or open-end questions, highly trained and experienced 
interviewers are essential. All interviewers, however, need special qualifica- 
tions and preparation. Checking procedures are also desirable as a means 
of controlling the accuracy and honesty of individual interviewers. Of special 
psychological interest are studies showing th 
fluenced by interviewer bias as well 
istics of the interviewers, such as thei 
group memberships (12, 39, 57), 


The problem of sampling is of fundamental importance in all attitude su! 
veys. It is especially acute in public opinion polling and market surveys. I" 
order to obtain a satisfactory sample, it is first necessary to define and de- 
scribe the population to be surveyed. Then a sample must be chosen that will 
be representative. of that population. When the population is large and 
heterogeneous, as in the case of voters in a national election, it becomes €X- 
tremely difficult to secure a truly representative cross section. Biased samples» 
containing a disproportionate number of persons of certain types, are likely 
to give incorrect estimates of Population opinions. 

Although a number of sampling procedures h 
out (cf. 48), poor sampling remains 
lic opinion polls and—to a lesser e 
pollster is beset with sampling pitf| 
economic levels tend to be less 
purposes. Consequently, they are 
larly, individuals who fill out 


at attitude survey results are 1n- 
as by certain recognizable character- 

: : ; or 
r socioeconomic level, race, and othe 


ave been carefully worked 
one of the principal weaknesses of pub- 
xtent—of market surveys. The path of ios 
alls. For example, persons in lower soor 
accessible and less cooperative for survey 
often under-represented in such polls. Sim- 
and return mailed questionnaires often differ 
o fail to reply. In many cases, the mail r- 


orably inclined toward the company or organ 
g the survey, 


spondents tend to be more fay. 
zation that may be sponsorin 


r E is 
ble bi and their responses reflect th 
favorable bias. 
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All of the problems considered above—including validity. reliability, ques- 
are encountered in 


tionnaire construction. administration, and sampling. 
Varying degrees in both opinion polling and attitude measurement. From a 
technical Viewpoint, attitude scales are clearly superior to opinion polls. In 
their construction and use, attitude scales are more nearly similar to psy- 
chological tests. Although by their very nature most polls and many attitude 
Surveys must be custom-made to meet specific needs, at least some attitude 
Scales have been developed for general use. In the following section, we shall 
examine in more detail the procedures employed in the development and 
application of attitude scales. Typical examples of available standardized 


Scales will be considered. 


ATTITUDE SCALES 


Attitude scales are designed to provide a quantitative measure of the in- 
dividual’s relative position along a unidimensional attitude continuum. Spe- 
cial Procedures have been devised in an attempt to achieve comparability of 
Scores from scale to scale, equality of distances between scale units, and uni- 
dimensionality or homogeneity of items. Thurstone's adaptation of psycho- 
Physical methods to the quantification of judgment data represents an im- 
Portant milestone in attitude scale construction (77). By these procedures, 

hurstone and his co-workers prepared about thirty scales for measuring 
attitude toward war. Communism, Negroes, Chinese, capital punishment, 


the church patriotism censorship, and many other institutions, practices, 


'Ssues, or groups of people. ; 

J The construction of the Thurstone-type scales may be illustrated by con- 
Sldering the Scale for Measuring Attitude toward the Church, the develop- 
ent of this scale having been fully reported in published form (78). Essen- 
ally the same procedure was followed in preparing all other scales in the 


Series + E g er of statements regardi 
les, The first step was to gather a large number c a egarding 


© church. These statements were obtained principally by asking several 
Soups of people to write out their opinions about the church. The list was 
“Upplemented with statements taken from current literature. A search was 
ade for expressions of opinion ranging from extremely favorable, through 
Neutral, to extremely unfavorable. From the material thus collected, a list 
f 130 carefully edited, short statements was drawn up. ' ‘ 
hese statements, each mimeographed on a separate slip, WETE then given 
mach of 300 judges for sorting into 11 piles, from A to K: The judges were 
‘Sttucteq to put ^n category A those statements they believed expressed the 
Site appreciation of ‘the value of the church, in category F those ex- 


ti 
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pressing a neutral position, and in category K those expressing the strongest 
depreciaton of the church. In the intervening piles, statements were to be 
arranged in accordance with the degree of appreciation or depreciation they 
expressed. This sorting procedure has been described as the method of 
"equal-appearing intervals,” although the judges were not actually told that 
the intervals between piles were to appear equal. It should also be noted 
that the judges were not asked to indicate their own attitudes toward the 
church, but were requested only to classify the statements. 

The percentage of judges who placed each statement in the different cate- 
gories constituted the basic data for computing the "scale values" of the 
statements. The method is illustrated in Figure 114. On the baseline of the 
cumulative frequency graph are the numbers | to 11. corresponding to cate- 
gories A to K, which are treated as equally spaced points on the scale. The 
vertical axis shows the percentage of judges placing the statement in or be- 
low each category. The 50th percentile, or median position, assigned by the 
judges to the statement can be read directly from the graph. This median 
position is taken as the scale value of the statement, It will be noted that 
for one of the statements illustrated. ( No. 39), which is quite favorable to 
the church, the scale value is 1.8. The other statement (No. 8) has a scale 
value of 6.7. Once the scale values were computed for all statements. the 
next step was to select those statements whose scale values were equally 
spaced along the attitude continuum. 

Besides the scale value, or median position, of cach statement, the graphs 
also show the variability or spread of positions assigned to it by the different 
judges. The index of variability employed for this purpose is Q, or half the 
distance between the 25th and 75th percentile points. Reference to Figure 


114 shows that the judges agreed closely in the placement of statement 39. 


but varied widely in classifying statement 8. This difference is reflected i” 
the Q’s of 1.3 and 3.6, respectively, which were obtained for the two state- 
ments. This measure of variability was taken as an index of the ambighity 
of statements. Thus ambiguous or double-barreled st m 
variously interpreted by different judges, would te 
classified. Accordingly, statements y 
the final scale. 


atements, which 
nd to be less consistently 
ielding high Q's were eliminated from 
The statements were also checked for irrelevance, This was accomplished 
by presenting the 130 statements to subjects with the instructions to mark 
those statements with which they agreed. 
statistically to determine their internal co 
meet the 


zed 
The responses were then analyze 


P - to 
tod 1 nsistency. Statements that failed 
criterion of internal consistency were excluded as be 


! ant 
; ing irrcleva? 
to the variable under consideration. 
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Statement No. 39: “I believe the church is absolutely needed to overcome the 
tendency to individualism and selfishness. It practices the golden rule fairly well.” 


1.00 


.90 
.80 
70 


-60 
.50 
40 


Cumulative Percentage of Judges 


4 5 6 7 
Scale Categories 


Statement No. 8: “I believe the church has a good influence on the lower and 
uneducated classes but has no value for the upper, educated classes. 


1.00 


o 
o 


Scale Value = 6. 
Q=3. 


Cumulative Percentage of Judges 


4 
Scale Categories 


Fi Thur e-Type Attitude Scale. (Adapted 
Fig. qq nempe scale Values for Thurstone-Type Atti oss MEO IS 
Con Ti aru e ania | apes A 39: reproduced by permission of University of 
e and Chave. 78, pp. 3^. > 
Press.) 
The final scales thus comprise items that proved! to ‘be: relatively un- 
è " 2 distributed over the range of scale values. 


lp " 
uous, relevant, and evenly i : a 
ude toward the Church consists of 45 


Eiee à ; 
iten "esulting Scale for Measuring Attit 


Ost of the other scales in the series have been prepared in two paral- 
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lel forms, each containing 20 to 22 statements. The sequence of statements 
is random with respect to scale values. The latter are not, of course. shown 
on the test blank. 

In taking any of the Thurstone-type attitude scales, the subject marks all 
statements with which he agrees. His score is simply the median scale value 
of the statements he has endorsed. Parallel-form reliabilities of such scores 
generally fall between .70 and .90, clustering in the low .80's. Some valida- 
tion studies have been undertaken, principally by means of contrasted 
groups or self-ratings. But the available evidence for validity is meager. 

The Thurstone technique has been applied to the development of attitude 
scales for many purposes. Thurstone-type scales have been constructed. for 
example, to measure the attitude of employees toward their company. A 


few statements from one of these scales, with their corresponding scale valucs, 
are shown below (82): 


I think this company treats its employees better than any other company jå 
10. 
does 


If I had to do it over again I'd still work for this company 32 


The workers put as much over on the company as the company puts over 


oL UN ENCARTA LAMEDRE 5.1 
You've got to have "pull" with certain people around here to get ahead. . 2.1 
An honest man fails in this company 0.8 


A difficulty inherent in the development of 
tains to the possible effects of the judges’ own 
tion of the statements. Thurstone recognized 
the scale is to be regarded as valid, the scale values of the statements 


should not be affected by the opinions of the people who help to construct 
it,” and adding “until experimental evidence m 
make the assumption th 


Thurstone-type scales per- 
attitudes upon their gr 
this problem, stating that J 


: all 
ay be forthcoming, we shal 
at the scale values of the st 


f ent 
; peig atements are independe” 
of the attitude distribution of the readers who sort t 


he statements (78, p- 92) 
A number of early studies corroborated Thurstone's original assumption. in 


so far as scale values did not differ appreciably when redetermined on group? 
known to differ in their own attitudes (4, 23, 35, 54), 


Later investigations, however, found that under certain conditions scal? 
values are significantly affected by judges’ attitudes (22. 24, 37, 43). Thus 
large and significant shifts in the scale values of statements about war 0€ 
curred from 1930 to 1940 (22). Similar differences were obtained with the 
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Scale on attitude toward Negroes, when items were rescaled on several Negro 
and white groups (37). The constancy of scale values reported by the earlier 
Studies appeared to have resulted in part from the failure to include subjects 
With à sufficiently wide range of attitudes. Moreover, it was a common practice 
In these studies, as in Thurstone's original work, to discard the records of 
Judges who placed 30 or more statements in a single pile. Such a massing of 
Statements was regarded as evidence of careless sorting. It has subsequently 
been demonstrated, however, that these disproportionate groupings are more 
likely to occur among individuals whose own views are extreme (3T) 

In general, intergroup differences in item placement are reduced when 
ambiguous items with large inter-judge variability (Q) are eliminated and 
When onl d arge scale differences are chosen (24). On 


y items separated by ! 
ith ambiguous and relatively 


the 

€ Other hand, if judges are presented W 
Neutral ; cu à are 
Cutral items and if the conditions of judgment are 
When judges are allowed to determine how many categories to use—the 


lassification of items is so strongly affected by 
as a disguised attitude test (43). 


not highly controlled—as 


the judges’ own opinions as 


to permit the use of this procedure itself 
, Another approach to the construction of 
Dis (46). Unlike Thurstone-type scales. ia Poveda 
S not require the classification of items by-a-group-or' judges. MEMS are 
Selecteq solely on the basis of the responses of subjects to whom they are 
ET Ministered in the exi of developing the test. Internal consistency is often 
* only criterion for item selection. although external criteria may be em- 


Ploye. 
yed when available. 


attitude scales is that followed by 
the Likert scaling procedure 


calls for a graded response to each 
atement. The response is usually expressed in terms of the following five 
Categories: agree (A). undecided (U), disagree (D), 
“individual statements are either clearly 
To score the scale, the alternative re- 


" The Likert-type scale, moreover. 
strongly agree (SA). 
„Strongly disagree (SD). The 
Worable X 


5 or clearly unfavorable. 
SPonses arly 


as are credited 5, 4, 3, 2. 
Unfavorable end. For example. 


ent : 
,. WOuld receive a score of 5. as W f indivi . 
Vorable statement. The sum of the item credits represents the individual's 


Ota] y in terms of empirically establis 
ri Score, which must be interpreted in terms of empirically established 
Ms, 


or 1, respectively. from the favorable to 
“strongly agree with a favorable state- 
ould "strongly disagree" with an un- 


A An example of a modified Likert-type scale is the Minnesota Teacher 
p ttude Ne (am Designed to assess pupil-teacher relations, this test 
1 developed he dedii over scven hundred items to 100 teachers 
Minated *» Biss reip i superior in pupil-teacher relations and 100 


minata as inferior. Cross validation of the resulting 150-item inventory in 
s inferior. ss-vallde š 
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different groups yielded concurrent validity coefficients of .46 to .60 with a 
omposite criterion derived from principal's estimate, pupils" ratings, and 
c s m , 
valuation by a visiting expert. Two sample items from this inventory are 
eve d id c 

shown below. 


Most pupils are resourceful when left on their own. 


A teacher should never acknowledge his ignorance of a topic in the presence of 
his pupils. 


For each statement, respondents mark SA, A, U, D, or SD. The scores ux 
signed to these responses are based on criterion keying, as in the ipid 
VIB, rather than on arbitrary 1 to 5 weights. Since its publication, this tes 
has been widely used in research. For practical application in selection uad 
counseling, more information is needed, especially with regard to predictive 
validity and the interpretation of norms from different groups. ; 
Mention should also be made of the work of Guttman on "scalogram 
analysis" and of Lazarsfeld on "latent structure analysis" (66). These ia 
niques were developed largely during World War II, in connection wit Pn 
monumental project on the measurement of soldier opinions. Both ae "nl 
sentially procedures for determining the unidimensionality or HER 
items through an analysis of the responses given by a trial group of i lanes 
The approaches of Guttman and Lazarsfeld provide theoretical models. a 
conceptual frameworks, for attitude measurement, which represent inet 
ing starting points for further research. In their present form, however, fact 
techniques are still limited by a number of unresolved difficulties, bO 


j 2 " à struction 
theoretical and practical. A general survey of techniques for the construc 
of attitude scales can be found in Edwards (21). 


OTHER MEASURES OF INTERESTS, 
ATTITUDES, AND RELATED VARIABLES 


There remain a few w 


; ; : non 
ell-known instruments which have much in con 
with interest tests, 


: e clearly 
attitude scales, or both, but which do not fall cleat!) 
into either category. Some of these tests also include me 


p ap pet- 
asures of other p* 
sonality variables, thus overl 


2 r j or 18- 
apping the categories covered in Chapter 
Moreover, each of these instruments exhibits uni 


use, or construction procedures, 
The Allport-Vernon-Lindzey Study of V 
3 : Y P ive at 
ure the relative prominence of six basic interests, motives, or evaluative 


titudes. Originally suggested by Spranger's Types of Men (65), the value 
categories may be described as follows: 


. PAG nt 
que features in its conten? 


5 eas" 
alues (1) was designed to mee 
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The. x 4 à P 2 3 " M 
eona al: Characterized by a dominant interest in the discovery of truth and 
Y an empirical, critical, rational, “intellectual” approach. 

Economic: Emphasizing useful and practical values: conforming closely to the 

Prevailing stereotype of the “average American business man.” 


on form and harmony: judging and enjoying 


Aesthetic: : : 
esthetic: Placing the highest value 
andpoint of its grace. symmetry, or fitness. 


ea š : " 
ach unique experience from the st 


this category has been more nar- 


Social: Originally defined as love of people. 
ver only altruism and philan- 


TOWly lim; à s 
owly limited in later revisions of the test to co 
thropy, 

Political: P influence, and renown: not 


rimarily interested in personal power. 


Tecesee siden: A vida Ae 
Cessarily limited to the field of politics. 
Religious: Mystical, concerned with the unity of all experience, and seeking to 
Compre E 

nprehend the cosmos as a whole. 


" Items for the Study of Values were first formulated on the basis of the 

Neoret; iteri e final i 

2 HR framework provided by Spranger. The criterion for the final item 

Select: ga ise g ag e 
"Con was internal consistency within each of the six areas. Intercorre- 


ati WES i rl: p 
ations of scores on the current form reveal no substantial overlap among any 


of s ; : 
these areas, The items are arranged in random order in the test booklet, 
ding to which they will be 


With : 

bí No clue regarding the categories accor 

‘ 8 g s ; nhe $us 

: red. Each item requires the preferential rating of either two or four al- 
ern gories. Two sample items are repro- 


jer falling in different value cate 
in Figure MEA Ky E 
ine ingenious arrangement of answer eme scoring is simplified and 
of Yi s key other than simple instructions 
Profile E Total scores Ted 
Oye (cf. Fig 116). It should be noted tah 7i 
Yed in this test. final scores reflect only relative 


lus j et: T in all are: 
Sit would be impossible to obtain high or low scores n all areas. 
a college population and are reported 


are also provided on 


printed on a detachable page 
on the six values are plotted in the form of a 
owing to the item form em- 


strength in the six areas. 


Validity has been ebscket partly by the method of contrasted. groups. 
rok as e > a R A Ae 
dig es Of various educational and occupational samples exhibit significant 
iffa a s ec ed al ¢ : 

Ter For example. medical students ob- 


| B " è k 
nces in the expected directions- 
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Part |. The two alternatives are rated 3 and 0, if the subject agrees with 
one and disagrees with the other; if he has only a slight preference 
for one over the other, they are rated 2 and 1, respectively. 


Example: 


If you should see the following news items with 
headlines of equal size in your morning paper, 
which would you read more attentively? (a) PROT- 
ESTANT LEADERS TO CONSULT ON RECONCILIATION; 
(b) GREAT IMPROVEMENTS IN MARKET CONDITIONS. 


Part Il. The answers are rated in order of personal preference, giving 4 
to the most attractive and 1 to the least attractive alternative. 


Example: 


In your opinion, can a man who works in business 
all the week best spend Sunday in — 


a, trying to educate himself by reading serious books 


b 
b. trying to win at golf, or racing S 
c. going to an orchestral concert d 
d. hearing a really good sermon [Ei 


Fig. 115. Sample Items from Allport-Vernon-Lindzey Study of Values. (Reproduced 
by permission of Houghton Mifflin Company.) 


tained their highest scores in the theoretical area, theological students in the 
religious area. More extensive validation data have been gathered with the 
first edition of the Study of Values, which was in use for about twenty 
years, although some comparable data are also available for the revised 
form (13, 19, 52, 83). Some relationship has been demonstrated betwee? 
value profiles and academic achievement, especially when relative achieve 
ment in different fields is considered. Data are also available on the correla 
tion of value scores with self-ratings and associates’ ratings. Other overt DA 
havioral indices of attitudes with which Scores on the Study of Values have 
been compared include newspaper reading, descriptions of one's "ideal per 
son," club membership, church attendance, and 
ships in the expected directions have likewise been reported with a number 
of other tests, such as Strong VIB and Thurstone attitude scales. Finally’ 
some studies have shown Significant changes in score following specific type? 
of experience, such as a period of study under different "styles" of education. 

It might be noted that a Pictorial Study of Values (63) has been prepare 
for subjects with linguistic or reading difficulties. Pictures for this test We" 


i iie ion- 
the like. Significant relatio! 
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— 
PROFILE OF VALUES 
70 |— = € — x. 
High < 60]— —— m E + 60 
| 
(50 
Average (40 at 
30 + 
Low 4 20 | 
jer] en C — pel ee eee 
Theoretical [ Economic | Aesthetic Social |] Politcal | Religious 
Average Male Profile Average Female Profile 5232 
Fig E T 
IMG. Sex Differences on the Allport-Vernon-Lindzey Study of Values. (Repro- 


Uce = 
d by Permission of Houghton Mifllin Company.) 


‘Ssigned to each of the six scales on the basis of their correlations with 
1 lPort-Vernon-Lindzey scores in a sample of 100 cases. Although providing 
eea ising idea for test development, in its present form this test is not 

ady for general use. 
and attitudes with the assess- 


d test that combines measures of interests 
"t of certain emotional and social characteristics is the Attitude-Interest 


ysis Test developed by Terman and Miles CIA. Tae Commonly known 
a © M-F Test, this instrument represents the first and most extensive 

mpt e of masculinity-femininity by empirical 
as those included in the MMPI, 


a 


to construct a measur 


Crite f 
ym ON keying. Other MF scales, such 
> and Guilford-Zimmerman Temperament Survey, have subsequently 


a vue 
ine cated but these scales are based on fewer and more limited types of 
ms than are found in the Terman-Miles test. It cannot be assumed that 


ese dif 
different scales are interchangeable. 


Ores n 
€s are usually low (18). 
© development of the Terman 


The correlations between their 


.Miles test began with an exhaustive 


h of the psychological literature for types of test content that yielded the 
= The preliminary sets of items prepared on 
hundreds of persons, ranging 


Stare 

Mog 
St 

this Pronounced sex differences. 

fro asis were then administered tO many 
e elementary school children to college xp ae E i 

rinci gt z i a« the relative proportion of m 
nq Principal criterion for item selection Was t prop nen 


Wo s that yielded significant sex dif- 
Men giving each response. Items y £ 


dents and many adult groups. 
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ferences were retained, while those that failed to discriminate between Mae 
and women were discarded. The direction of sex difference in iu epis ee 
response determined whether the particular response was scored as eed 
or feminine. The final test was prepared in two equivalent forms, each Fall 
sisting of seven subtests: Word Association, Inkblot Association, i arta 
Emotional and Ethical Attitudes, Interests, Opinions, and Introvertive Re- 
sponse. It will be noted that, although concerned predominantly with nE 
terests and attitudes, the test also includes measures of other personality 
characteristics, such as introversion-extroversion. Some of the subtests, more- 
over, utilize adaptations of projective techniques and other testing procedures 
to be considered in Chapters 20 and 21. 

In interpreting MF scores on either the Terman-Miles or other scales si 
signed for the same purpose, two points should be borne in mind. First. M 
scores show only the degree to which the individual's responses agree with 
those most characteristic of men or women in the culture within which the 
test was developed. For the Terman-Miles test, the culture is that of d 
United States in the 1930's. Second, it should be noted that such tests er 
deliberately designed so as to exaggerate sex differences. The behavior s 
men and women has much in common. The MF tests, however, concentrate 
only upon the differences. Although these tests can be used to determine as 
extent to which the individual approximates the norm for his or her sex, they 
do not provide a basis for establishing the amount of sex difference in PSY 
chological traits. 

As a final example of a test th 
gories we may consider The DF 
and his associ 


at cuts across some of the traditional oy 
Opinion Survey (31) prepared by Guilforc 
ates. Yielding separate scores in 10 “dynamic factors” (DF): 
this test was designed to measure needs (as in the Edwards Personal Prefer- 
ence Schedule) and broad interest areas (as in the Kuder— Vocational): 
The 10 factors covered by the test are based largely on Guilford’s previously 


cited factor analyses of interests, Guilford classifies needs, interests, and atti- 
tudes as dynamic factors within the genera 


Examples of the traits for which Scores 
vey include need for attention, 
adventure versus security, and cult 


are provided in The DF Opinion d 
liking for thinking, aesthetic appreciatio™ 
ural conformity, 


THE PLACE OF INTERESTS IN PERSONALITY THEORY 


The measurement of interests began as 


; and 
£ a relatively specific, minor, à? 
tangential development in the study of pers 


` y jes 
onality. Early interest inventor! 
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Were oriented chiefly toward the prediction of the individual's eventual ac- 


Ceptance or rejection of particular job functions. From these modest begin- 


nings, interest tests are gradually coming to play a major role in the formula- 
tion of personality theory. The impetus for these developments is coming 
from several different sources. Factorial analyses of interest items, as illus- 
trated by the work of Guilford (30), have demonstrated the interconnec- 
ant dimensions of personality. The previ- 


ti : s 5 
Ons of interests with other import 
highlights 


ously cited DF Opinion Survey, combining interests and needs, 
the broadening concept of interests and the recognition of their motivational 
Aspects, * 

e revealed significant associations between 
aspects of personality. Research on 
and persons with other psychiatric 
of certain characteristic. vocational 


eae Bumper of studies hav 
al , 3» vocational interests and other 
Mi male homosexuals, neurotics, 
S Se has shown a predominance 

St patterns in each group (25, 33. 5$ 


the Si, $ 
* Strong VIB and the Kuder— Vocational 
rm 


have proved to be related to per- 


ance on other personality tests. such as the MMPI and the Study of 
“alues (9, 17, 28, 36). Some investigators have provided personality de- 
“criptions of normal persons scoring high or low on particular vocational in- 
terest scales. In a survey of 1000 male University of Minnesota freshmen, 
“Sores on certain VIB occupational scales were found to be significantly cor- 
related with other measured emotional and attitudinal variables (17, Ch. 4). 
Or ex ; hieh in the social service or business contact 
mined higher social adjustment scores than 
On economic conservatism, these same 
those with high social service interests 


igh business contact interests (17, pp. 


ips, ee students scoring 
hose i of occupational keys obt 
two Scoring high in other clusters. 
being pe PS scored in opposite ways. 
lig ms liberal than those with h 
2). 

Ore detailed personality evaluati 

erent Occupational interest sca 


ons of individuals receiving high scores 
les are to be found in a study of 100 


ir Fore "versitv of California (cf. 17, pp. 128- 
e at the University 
129). Officers conducted at ; vie of tests, including the Strong VIB, 


ie Aera i d PL es assessment program pisc hoa ang 
ahi. observational techniques. On the basis of indo Gh nee m MERE 
Jects were described by eight clinical psychologists m ose, pira 
the V, Personality variables, Correlations o xc d misi, Na z 
Occupational keys revealed à ande oee bare ufi ith 
For illustrative purposes. the pe 


ig E 
; ; Ww: 
Scores on two keys are summarized below 


lations 
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High Scorers on Mathematician Key: Self-abasing, concerned with pait oprie 
problems. introspective, lacking in social poise, lacking confidence in own a : A 
sympathetic. reacts poorly to stress, not persuasive in personal contacts, no 
effective leader, not ostentatious, not aggressive or socially ascendant. 


High Scorers on Real Estate Salesman Key: Self-indulgent, guileful, enice 
opportunistic, aggressive, persuasive, ostentatious, may arouse hostility in ot E: 
not sympathetic, not concerned with philosophical problems, not lacking ¢ 
fidence in own ability, not self-abasing. 


From a different angle, it is now widely recognized that the choice of an 
occupation often reflects the individual's basic emotional needs and that e 
cupational adjustment is a major aspect of general life adjustment (7, 17, 3 Ds 
58). There are many different ways of dealing with interpersonal relations 
and other life problems. No one way is universally better than others. When 
he chooses a vocation, each individual is to some extent selecting those ad- 
justment techniques, life patterns, and roles most congenial to himself. THE 
measurement of vocational interests—and more specifically the aie 
of those occupational groups whose interests and attitudes the ipee: 
shares most closely—thus becomes a focal point in the understanding of A 
ferent personalities. Direct studies of the characteristics of persons in aun 
ent occupations have been contributing to a growing fund of factual materi? 
for implementing this approach (cf., e.g., 58). m 

In line with the above viewpoint, Holland (36) has come full circle in d 


veloping an inventory of Occupational titles as a me 


v SHE 
asure of personality chi 
acteristics. 


x " s P AS or 
In this test, the respondent merely indicates whether he likes 


dislikes each of 300 occupations. Although constructed | 


argely in terms ° 
content validity 


A " s ; ;enificant 
and internal consistency, the resulting scales yield significa 
differences between matched samples of psychiatric 


trols. Interest profiles of students in different curri 
with expectation. The inventory provides 10 
trated by Physical Activity, Intellectuality, Responsibility, Conformity. am. 
There are also three response set scales, including Question (number. “3 
items left blank), Infrequency (number of rare items marked), and Acquic". 
cence (number of items liked"). How effective this inventory will ultimately 
prove to be in either research or practice remains to be seen. But its develop” 


à s i ;onal- 
ment illustrates current emphasis on vocational choices as clues to person? 
ity. 


s al con- 

patients and normal cc 4 
sister 

cula were also consiste 


" GS as jlluS- 
"personality" scales, as ill 


From still another angle, Tyler (81) re 
way of identifying the choices that the in 
These choices are both 


e" 
gards the study of interests moe 
dividual makes at various stage 


. z acast 
a reflection of the kind of person he is and a forec? 


of what he is likely to become. As each choice is. made— choice: of friends: 
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Tecreations, courses, jobs, and the like—the individual's subsequent experi- 
ences are thereby channeled into certain paths. Alternative developmental 
Toutes are eliminated at each choice point. Translating the predictive valid- 
ity of the Strong VIB into these terms, Tyler writes: 


an signifies that the person's characteristic pattern of acceptance and rejection 
abes varied possibilities is like the choice pattern characteristic of persons in a 
ur occupation. What we should expect then to be able to predict from such a 

eis... the way he will make his choices at later junctures of his life. This 
Makes sense of the high degree of validity Strong's recent studies have shown for 


the test (81, p. 78). 


The measurement of interests promises to be a lively field of test develop- 
ment ahead. Having proved so far to be among 


and research in the years 
aptitude domain, interest inventories 


th X à 

n Most successful tests outside the 
5h DOW well on the way to attaining theoretical respectability. At the same 
Me, we can anticipate that the interpretation of interest profiles will show 


fu AR ene 
"ther advances in depth and sophistication. 
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CHAPTER: 90 ———— 


Projective Techniques 


The chief distinguishing feature of projective techniques is to be found in 
their assignment of a relatively unstructured task, i.e., a task that permits an 
almost unlimited variety of possible responses. In order to allow free play tO 
the subjects imagination, only brief, general instructions are provided. For 
the same reason, the test stimuli are usually vague and equivocal. The under- 
lying hypothesis is that the way in which the individual perceives and inter- 
prets the test material, or "structures" the situation, will reflect fundamental 
aspects of his psychological functioning. In other words, it is expected that 
the test materials will serve as a sort of screen upon which the subject “pro- 
jects” his characteristic ideas, attitudes, strivings, fears, conflicts, aggressions 
and the like. 

Typically, projective instruments also represent disguised testing pro 
dures, in so far as the subject is rarely aware of the type of psychological is 
terpretation that will be made of his responses. Projective techniques 
likewise characterized by a global approach to the 


arc 


i | personality- 
appraisal of personal y 


Attention is focused upon a composite picture of the whole personality. rather 
than upon the measurement of separate traits. : 
Although the term “projective technique” was first applied to this type o! 


instrument by Frank (43) in an article appearing in 1939, such technique? 
had been in use for many years prior to that d 


` ; T jgl^ 
tea a Fy : ate. Projective methods OMe 
nated within a clinical setting and h 


: z i for 
witl ave remained predominantly a tool | 
the clinician. Some have evolved from thera 


therapy) employed with psychiatric patients, 
most projective techniques reflect the influence of psychoanalytic concep 
The emphasis placed upon a global or holistic approach, moreover. will x: 


recognized as a contribution of Gestalt psychology It should be noted. ? 
course, that the specific techniques ne Š i 


: est 
ticular theoretical slant cd not Be evaluated in the light of un 
articula etical slants : n E £ ; 
P s or historical origins. A procedure may prove t° 
56 
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peutic procedures (such as 
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practically useful or empirically valid for reasons other than those initially 
cited to justify its introduction. 

In line with their typically global approach, projective techniques have 
been concerned not only with emotional and social characteristics, evidences 
of maladjustment, interests, attitudes, and motives, but also with certain in- 
tellectual aspects of the individual's behavior. Examples of the latter include 
"general intellectual level," originality, and characteristic methods of attack- 
ing problems. Certain adaptations of projective techniques have been spe- 
cially designed for the measurement of attitudes (33. 103), and thus sup- 
Plement the instruments described in Chapter 19. Examples of projective 
attitude tests will be cited in the appropriate sections of this chapter. Still 
other approaches to attitude measurement will be illustrated in Chapter 21. 
A The array of projective techniques that have been published or described 
'n the literature is large and steadily growing. Once the underlying principle 
Is grasped, it is relatively easy to design a "new model" that may exhibit 
Varying degrees of resemblance to earlier instruments. Whether the new tech- 
nique represents a real improvement over existing procedures is much more 
difficult to demonstrate. Few projective instruments have advanced beyond 
the Stage of preliminary exploration. It might be added that almost any psy- 
chological test, for whatever purpose designed. can also serve as a projective 
strument. Intelligence tests. for example, have been employed in this fashion 

Y Some clinicians (cf. 44). YENA 

For the present purpose. only a few outstanding examples of projective 
techniques can be considered. Similarly. a critical examination of individual 
struments would be beyond the scope of the present volume. Instead, a 
Summary evaluation of such instruments as a group will be given in the last 
Section of the chapter. More extensive surveys and fuller discussions of spe- 
cific techniques are available in a number of published sources, such as Abt 
and Bellak (1). Anderson and Anderson (10). Bell ( Is ds Brower and Abt 
(26). Watson (101). and Weider ( 102). Critiques from different angles can 


also be found in articles by Cronbach (36. 37) and Eysenck (41). among 
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Yers. The Mental Measurements y : ; : j 
al reviews of all instruments cited in this 


Yearbooks devote a separate section to 
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3 Jective techniques, where critic 
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rojective techniques have been cl 


ers. ¢ : ali 
S. such as nature of the stimuli R : 
Manner f and test-construction procedures. Lindzey 
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(64 
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as proposed a class pe FS st 
Oes Such à ie D uh yn focus upon à characteristic that is important in its 
à classificatic s : 


Wn s sme: 
: wees cement wi 
"ght, but it also shows close agree 


assified with reference to various param- 
presented. method of administration. 
interpreting responses. 


fication in terms of mode of response. Not only 
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th a composite classification 
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based on all major attributes of well-known projective techniques. This classi- 
ase à À 1 à 

fication, which will be followed in the present chapter, groups projective 
fic s 

techniques into five categories: 


(1) associative techniques, in which the subject must respond to a stimulus by 
giving the first word. image, or percept that occurs to him; 


"T d ] f " 
(2) construction procedures, requiring the subject to create or construct 
product, such as a story; 


(3) completion tasks, such as completing sentences or stories; 


2 P A tine. Us 
(4) choice or ordering devices, calling for the rearrangement of pictures, r 
cording of preferences, and the like; 


(5) expressive methods, such as drawing, w 
cedures in that the subject's style or met 
ished product. 


hich differ from construction pro 
hod is evaluated as well as the fin 


ASSOCIATIVE TECHNIQUES 


Word Association. A technique that antedated the current flood of P 
jective tests by more than half a century is the word association test [m 
Ch. 2). Originally known as the "free association test," this technique was 
first Systematically described by Galton in 1879 (47). Wundt subsequently 
introduced it into the psychological labor 


first word that comes to his mind. The 
well as the first mental testers, saw in s 
exploration of thinking processes. 

The clinical application of word associ 
by the Psychoanalytic movement, althoug 
pelin, had Previously investigated s 
lysts, Jung’s contribution to the system 
tion test is most conspicuous (58). Ju 


: ists, aS 
early experimental psychologists. * 


Fan for the 
uch association tests a tool for 


atic development of the word an 
ng selected stimulus words to represe” 
common "emotional complexes." The responses were analyzed with reference 
to reaction time and content, the latter being classifi 

eral character of the association, « » Supraordinate, modifying 
adjective, sound association, and the like. Over, expressions of emotion? 
tension, such as laughing, flushing, and hand movements, eTe also note” 
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The test was then readministered with the subject being instructed to try to 
Tecall the original responses. Changes in response words and other features of 
the subject's retest behavior provided further diagnostic clues. 

More recently, a word association technique was developed at the Men- 
minger Clinic by Rapaport, Gill, and Schafer (81. Ch. 2). In its general ori- 
entation, this adaptation reveals its kinship to the earlier Jung test. The 60- 
Word list contains a preponderance of words selected for their psychoanalytic 
Significance, many of them being associated with psychosexual conflicts. Ac- 
cording to its authors, the test had a dual aim: to aid in detecting impairment 
of thought Processes and to suggest areas of significant internal conflicts, Re- 
Sults are analyzed with reference to such characteristics as proportion of com- 
mon or popular responses, reaction times, associative disturbances, and im- 
Paired Teproduction on retest. 

A different approach to the word association test is illustrated by the early 
Work of Kent and Rosanoff (59). Designed principally as a psychiatric 
Screening instrument, the Kent-Rosanoff Free Association Test utilized com- 
Pletely objective scoring and statistical norms. The stimulus words consisted 
of 100 common, neutra words, chosen because they tend to evoke the same 
associations from people in general. For example, to the word "table," most 
People respond "chair"; to "dark," they say "light." A set of frequency tables 
Was Prepared— one for each stimulus word—showing the number of times 
each response was given in a standardization sample of 1000 normal adults. 

In Scoring the test. the median frequency value of the responses that the 
Subject gave to the 100 stimulus words was employed as an “index of com- 
monality « Any responses not found in the normative tables were designated 
üs “idiosyncratic.” Comparisons of psychotics with normals suggested that 
PSYchoticg give more idiosyncratic responses and obtain a lower index of com- 
Monality than the normals. The test fell into disuse, however, with the grad- 
val realization that the frequency of different responses also varies widely 
With age, socioeconomic and educational level, regional and cultural back- 
Srounq, and other factors. The task of developing adequate normative tables 
conn thus be prohibitive, unless the test were to be used within a narrowly 

ed Population. : MURS. am LAM 
number of investigators have been exploring the possibilities of utilizing 
Mographs a sii for free association, with promising results. Homo- 
qu are words that are spelled identically ie have two =e distinctly 
Her i "ring" may refer to an article of jewelry or 
to me ei ur quus on nin operation of a Aai i : to 

in er i and Oka belk ee may iterion keying, experimental versi 
Personal relations. By means of criterion keying P rsions 
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of homograph tests have been developed for measuring such traits as mascu- 
linity-femininity (48), leadership (35, 49), and emotional dependence upon 
one's social environment (98). 

The free association method has also been employed in the appraisal of 
interest and attitudes. A free association interest test was developed by Wye 
man (104) for analyzing the interests of gifted children. This investigation 
was part of the well-known study of gifted children conducted under the di- 
rection of Terman at Stanford University. In the Wyman test, the responses 
were scored with reference to "intellectual interest," "social interest," and 
“activity interest." 

Another adaptation of the word association technique for the study of atti- 
tudes is illustrated by the work of Murray and Morgan (77). Their procedure 
was to have the examiner read a list of 48 words—such as "father," "Com- 
munism,” and "religion" —to each of which the subject was to respond by 
giving “the most descriptive adjectives" he could think of. The subject was 
led to believe that the test was designed to measure the range of his vocabu- 
lary. Actually, however, the responses were analyzed with respect to the ratio 
of appreciative to depreciative adjectives. 

Mention may likewise be made of the use of the word association technique 
as a "lie detector." This application was also initiated by Jung, and has subse- 
quently been subjected to extensive research—both in the laboratory and 1n 
practical situations (cf. 31, Ch. 9; 32, Ch. 11). The rationale offered to JUS 
tify the employment of word association in the detection of lying or guilt is 
similar to that which underlies its utilization in uncovering areas of emotional 


conflict. Content analysis, reaction time, and response disturbances have all 
been explored as indices of lying or guilt. Frequently, physiological measure 
of emotional excitement are obtained concurrently with the verbal response: 
The word lists chosen for lie detection purposes 


ade tO 
are usually custom-made 
cover distinctive features of the particul 


P " d 4 ar {n= 
ar crime or other situation under ! 
vestigation. Whether word association has àny practical value as a lie detec- 


tor is still a moot point. Its effectiveness certainly varies widely with the sp? 
cific circumstances under which it is used. 


The Rorschach Inkblots. The best-known and most w 
jective technique is undoubtedly the Rorschach. Developed by the Swiss py 
chiatrist, Hermann Rorschach, this technique was first described in [921 
(84). Although standardized series of inkblots had previously been utilized 
by psychologists in studies of imagination and other funictions Rorschach was 
the first to apply inkblots to the diagnostic investigation of ihe ersonality as 
a whole. In the development of this technique, Rorschach ex P rented wit? 
a large number of inkblots which he administered m VES am. psychiatri? 


idely discussed pro 
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groups. By a process of trial and error, those response characteristics that dif- 
ferentiated between the various psychiatric syndromes were gradually in- 
corporated into the scoring system. The scoring procedures were further 
sharpened by supplementary testing of mental defectives, normals, artists, 
scholars, and other persons of known characteristics. Rorschach’s methodol- 
Ogy thus represented an early, informal, and relatively subjective application 
of criterion keying. 

The Rorschach utilizes 10 c 
Symmetrical inkblot similar to that illustrated 
blots are executed in shades of gray and black only; 
touches of bright red; and the remaining three combine several pastel shades. 
As the subject is shown each inkblot, he is asked to tell what he sees—what 


ards. on each of which is printed a bilaterally 
in Figure 117. Five of the 
two contain additional 


Fig. 117. An Inkblot of the Type Employed in the Rorschach Technique. 


a verbatim record of the subject's 
es time of responses, position or 
arks, emotional expres- 


t f : i 
ia blot could represent. Besides keeping 
*Sponses to each card, the examiner not 
Positions in which cards are held. spontaneous Tomam : 
Slons, and other incidental behavior of the subject during the test session. 
a 


“Ollowing the presentation of all 10 cards, the examiner questions the subject 


dae matically regarding the parts and as 
Ons were given. During this inquiry» 

: clarify and elaborate his earlier responses. 

Va he 10 cards are the only part of the Rorschach actually shared by the 

ürious techniques bearing that name. Several clinicians have developed 


ans Huber, in Berne, Switzerland. They can be 
or from a number of test distributors, such as 


pects of each blot to which the asso- 
the subject also has an opportunity 


ly 
Obtain cards, or plates, are printed by H 
N in America from Grune & Stratton 


` Stoelting and The Psychological Corporation. 
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their own Rorschach "systems," the major differences among them being in 
the scoring and interpretation of responses (17, 18, 55, 60, 61, 79, 90). Some 
of these systems are characterized by exaggerated claims regarding the appli- 
cability and effectiveness of the technique, or by naive metaphorical interpre- 
tation of responses, or by procedures so highly subjective as to render com- 
munication and replication difficult. Among the most widely used Rorschach 
techniques in America is that developed by Beck (17), which also adheres 
most closely to the original procedure followed by Rorschach. 

The most common scoring categories employed with the Rorschach include 
location, determinants, and content. Location refers to the part of the blot 
with which the subject associates each response. Does he use the whole blot, 
a common detail, an unusual detail, white space, or some combination of 
these areas? The determinants of the response include form, color, shading. 
and "movement." Although there is of course no movement in the blot itself, 
the subject's perception of the blot as a representation of a moving object 1s 
scored in this category. Further differentiations are made within these catego- 
ries. For example, human movement, animal movement, and abstract or 10- 
animate movement are separately scored. Similarly, shading may be per- 
ceived by the subject as representing depth, texture, hazy forms such as 
clouds, or achromatic reproductions of colors as in a photograph. 

The treatment of content varies from one scoring system to another, al- 
though certain major categories are regularly employed. Chief among these 
are human figures, human details (or parts of human figures), animal figures. 


animal details, and anatomical diagrams. Other bro 


ad scoring categories 10- 
clude inanimate objects, 


plants, maps, clouds, blood, x-rays, sexual objects: 
and symbols. A popularity score is often found on the basis of the relative 
frequency of different responses among people in general. For each of the 10 
cards, certain responses are scored as popular because of their common 0€ 
currence. 

Further analysis of Rorschach responses is b 
of responses falling into the various cate 
and interrelations among 
tive interpretations th 


ased upon the relative number 
gories, as well as upon certain ratios 
different categories. Examples of the sort of qualita- 
at have commonly been utilized with Rorschach r°- 
sponses include the association of “whole” responses with conceptual think 
ing, of “color” responses with emotionality, and of “human movement 

responses with imagination and fantasy life. In the usual application of the 
Rorschach, major emphasis is placed on the final “global” description of e 
individual, in which the clinician integrates the results from different parts ? 


the protocol and takes into account the interrelations of different scores i 
indices. In actual practice, information derived from outside sources, such ? 
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Other tests, interviews. and case history records, is also utilized in preparing 
such global descriptions. 

Although the Rorschach is considered to be applicable from the preschool 
to the adult level, its normative data were derived largely from adult groups. 
This limitation also characterizes the general fund of clinical experience ac- 
Cumulated through the use of the Rorschach and employed in the qualitative 
interpretation of protocols. In the effort to extend the empirical framework 
for Rorschach interpretation to other age groups, Ames and her co-workers 
at the Gesell Institute of Child Development at Yale collected and published 
Rorschach norms on children between the ages of 2 and 10 years (2), on 
adolescents between the ages of 10 and 16 (4), and on older persons from 
the age of 70 up (3). 

Some of the underlying assumptions of traditional Rorschach scoring have 
been called into question by a growing body of research findings. Compara- 
live studies with standard and achromatic series of Rorschach cards, for ex- 
ample, have demonstrated that color itself has no effect on most of the re- 
Sponse characteristics customarily attributed to it (16). There is also evidence 
that verba] aptitude influences several Rorschach scores commonly inter- 
Preted as indicators of personality traits (67, 87). In one study (87). an 
analysis of the verbal complexity of individual Rorschach responses given by 
100 Persons revealed that “movement” responses tended to be longer and 
linguistically more complex than “form” responses. It was also found that 
Yerba] complexity of Rorschach responses was highly correla teg with the sub- 
Jects” scores on a verbal aptitude test, as well as with age and educational 


leve], 


aj icati i i retation of Rorschach scores is 
A major complicating factor in the interpreta 


the total number of responses—known as response productivity or R (42). 
"Cause of large individual differences in R, the practice of considering the 
7 ıs categories is obviously misleading. 


abso] : i i 
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a ; ample, a 
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Ots is quite limited, while associations to isolated details of the blots may 
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atre indefinitely. 
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intellectual level, and amount of education. Even more disturbing is the find- 
ing that R varies significantly from one examiner to another (cf. 15, 50). 
All of these results suggest that R, which may itself be a major determinant 
of many of the common Rorschach scores, is influenced by factors quite ex- 
traneous to the basic personality variables allegedly measured by the Ror- 
schach. Findings such as these strike at the very foundation upon which the 
entire elaborate superstructure of Rorschach scoring is supported. 

Nor can any encouragement be found in empirical studies of Rorschach 
validity. Despite a bibliography of over two thousand publications on the Ror- 
schach, the vast majority of interpretive relationships that form the basis of 
Rorschach scoring have never been empirically validated. The number of 
published studies that have failed to demonstrate 


a significant relation be- 
tween Rorschach scores, combin 


ations of scores, or global evaluations and 
relevant criteria is truly impressive. The Rorschach was found to have little 
or no predictive or concurrent validity when checked a 


gainst such criteria as 
psychiatric diagnosis, 


response to psychotherapy, various determinations of 
personality or intellectual traits in normal persons, success or failure in a wide 
variety of occupations in which personality qualities play an important part, 
and presence of various conflicts, fears, attitudes, or fantasies independently 
identified in patients. Those studies that appear to provide positive results 
have been shown to contain serious methodological defects. Further attention 
to some of these methodological problems will be given in the concluding 
section of this chapter. For more detailed evaluation of the Rorschach tech- 
nique itself, the reader is urged to examine the unusually thorough discus- 


sions of this instrument by four independent reviewers in The Fifth Mental 
Measurements Yearbook. 


Partly as a result of the predominantl 


y negative findings of validation stud- 
ies, there has been 


a shift of emphasis from traditional perceptual or formal 
scoring to content analysis of Rorschach responses (105, 106) Rorschach 


himself and most of his followers have relied most heavily on the perceptual 

bases of the subject's associations to the inkblots, as illustrated by location, 

color, form, shading, etc. Implicit in such an approach are the assumptions 
2 ass 


that the subject's responses to the Rorschach Cards are indicative of his usual 
perceptual responses and that personalit 


trast to this approach, content analysis 
ceives in the blots. Such content can 


ject during a clinical i 
* a 
approach appear somewh 
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the clinical interview, attention is also being given to the role of interpersonal 
factors in Rorschach responses. There is considerable evidence to show that 
the social interaction of a particular subject with a particular examiner influ- 
ences many aspects of the subject's responses and of the examiner's interpre- 
tation of such responses (72. 88, 90). In the light of all available data, it 
Would seem best at this stage to regard the Rorschach as an interview aid for 
the skilled clinician, rather than as a test. This is not to deny the possibility 
that a valid, objectively scorable inkblot test of personality characteristics 
may be developed in the future. Experimentation along these lines is in 
Progress (cf., e.g., 57). and some of it may prove fruitful. 


CONSTRUCTION PROCEDURES 


In contrast to the associative techniques discussed in the preceding section, 
the lype of projective instrument. now under consideration requires more 
Complex and somewhat more controlled intellectual activities on the part of 
the Subject. Thus in telling or writing a story that fits a given picture, the in- 
dividual is bound by certain implicit conventions regarding grammatical ex- 
Pression, logical organization. ^ 1 
in the picture, and the like. The instructions, too, frequently focus on quality 
Of production, by introducing the task as a test of imagination or intelligence. 
Interpretation of responses is typically b i 
Qualitative nature. With the gradual realization that content analysis of pro- 
ay be more fruitful than formal scoring. there has been 
f clinicians to turn to these story con- 


unity of content, congruence with all elements 


ased on content analysis of a rather 


Jective techniques m 
an Increasing tendency on the part o UH. l 
Struction techniques, which provide more opportunities for content analysis 
the n 
an does the Rorschach. E. 2 4 

Thematic Apperception Test. The original and basic test in the present 
Category is the Thematic Apperception Test (TAT) developed by Murray 
and his staff at the Harvard Psychological Clinic (13; 53; 75; 76; 102, pp. 


6 i 
036-649). Not only has this tes : 
it it has also served as a model for the devel- 


this class. The TAT materials consist of 19 
in black and white and one blank card. The 


t been used much more widely than other 


Story construction techniques, bu 
“pment of later instruments in 
a containing vague pictures i 
to Ject is asked to make up a story 1o 77 Setup a ks 

the event shown in the picture. describing. Wiat Is Neppening at the: mo- 
Ment and what the characters are feeling and thinking, and giving the out- 
Some, In the case of the blank card, the subject is instructed to imagine some 


ic Er 

Picture On the card, describe it, i 5 

Procedure outlined by Murray (75) requires two one-hour sessions, 10 cards 
e a i 


to fit each picture, telling what led up 


and then tell a story about it. The original 
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being employed during each session. The cards reserved for the second ses- 
sion were deliberately chosen to be more unusual. dramatic, and bizarre. and 
the accompanying instructions urge the subject to give free play to his imagi- 
nation. Four overlapping sets of 20 cards are available—for boys, girls, men 
over 14, and women over 14. Most clinicians use abridged sets of specially 


selected cards, seldom giving more than 10 cards to a single respondent. A 
card from the second set is shown in Figure 118. 


Fig. 118. One of the Pictures Used in 


ed the Themati i roduced 
by permission of Harvard University Press. atic Apperception Test. (Rep 


) 

In interpreting TAT stories, the examiner first determines who i$ thie 
“hero,” the character of either sex with whom the subject has presumably 
identified himself. The content of the stories is then. analyzed principally ip 
reference to Murray's list of “needs” and “press.” Several of the propose 
needs were described in the preceding chapter, in connection with the E^ 
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wards Personal Preference Schedule. Examples include achievement, aggres- 
Sion, nurturance, and sex. Press refers to environmental forces that may facil- 
itate or interfere with the satisfaction of needs. Being attacked or criticized by 
another person, receiving affection, being comforted, and exposure to physi- 
cal danger as in a shipwreck are illustrations of press. In assessing the impor- 
tance or strength of a particular need or press for the individual, special at- 
tention is given to the intensity, duration. and frequency of its occurrence in 
different stories, as well as to the “uniqueness” of its association with a 
given picture. The assumption is made that unusual material, which departs 
from the common responses to each picture, is more likely to have signifi- 
Cance for the individual. 

A fair amount of normative information has been published regarding the 
Most frequent response characteristics for each card, including the way each 
Card is perceived, the themes developed. the roles ascribed to the characters. 
emotional tones expressed, speed of responses, length of stories, and the like 
(cf. 13; 53: 101, Ch. 16). Although these normative data provide a general 
framework for interpreting individual responses, most clinicians rely heavily 
On “subjective norms” built up through their own experience with the test. A 
Number of quantitative scoring schemes and rating scales have been devel- 
ped that yield good scorer reliability. Since their application is rather time 
Consuming, however. such scoring procedures are seldom used in clinical 
Practice, Although typically given as an individual oral test in the clinical 
Situation. the TAT may also be administered in writing and as a group test. 


here is some evidence suggesting that under the latter conditions produc- 


tivity of meaningful material may be f . 
, The TAT has been used extensively in personality research. Several inves- 
tigations have been concerned with the assumptions that underlie TAT inter- 
Pretations, such as self-identification with the hero and personal significance 
of uncommon responses (63). Although they cannot establish concurrent or 
Predictive validity of the TAT for specific uses, such studies contribute to the 
Construct validation of TAT interpretations. A basic assumption that TAT 


io" with other projective maam ip 
hye a B experimental data is available to show that 
Such conditions pe ditm sleep deprivation, social frustration, and the ex- 
Perience of failure ^ a preceding test situation —MÀ TAL ne 
oot (cf. 13), While supporting the projective i met des e 

the TAT to such temporary conditions may complicate the on o 
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S enduring individual traits. 
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Me of the research on the € 


acilitated. 


at present motivational and emo- 
esponses to an unstructured test 


fect of personality characteristics on TAT 
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responses has dealt with the concurrent validity of the TAT in differentiating 
various clinical groups. A number of studies employing rather sophisticated 
experimental designs have been concerned with the detection of aggressive 
tendencies. In interpreting the results of such investigations, it has been bz 
peatedly pointed out that the hypothesized relation between aggression in 
fantasy—as revealed in the TAT. 


and aggression in overt behavior is not à 
simple one. Depending upon other concomitant personality characteristics: 
high aggression in fantasy may be associated with either high or low overt 
aggression. There is some evidence suggesting that, if strong aggressive tend- 
encies are accompanied by high anxiety or fear of punishment, expressions of 
aggression will tend to be high in fantasy and low in overt behavior; when 
anxiety and fear of punishment are low, high fantasy 
with high overt aggression (78, 80). 

Lack of significant correlation between expressions of 
stories and in overt behavior in a random sample of cases is thus consistent 
with expectation, since the relation may be positive in some individuals and 
negative in others. Obviously, however, such a | 


aggression is associated 
ob V9 


aggression in TAT 


ack of correlation is also con- 
sistent with the hypothesis that the test has no validity at all in detecting 887 
gressive tendencies. What is needed. of course 


, is more studies using complex 
experimental designs that permit an 


analysis of the conditions under which 
each assumption is applicable. With reference to different needs, it has been 
suggested by Murray and others that whether the correlation between overt 
and fantasy expression is Positive or negative may depend in part on whether 
the satisfaction of a particular need is encouraged or inhibited in a given cul- 


ture. All of these findings again highlight the fact th 


should not be interpreted without reference to otl 
circumstances. 

It is apparent that the TAT can provide 
search or for qualitative interpretation by 
tempts to use it as an objective test in its 
leading results. Also relevant is the finding 
has proved susceptible to examiner 
interpersonal relation of examiner 


at personality trait score 


R ant 
her traits or concomitan 


rich material for personality Ie? 
an experienced clinician. But aly 
Present form could yield very M!S- 
that, like the Rorschach, the TAT 
and situational variables (72, 100). The 


" RAE MV 
and subject influences TAT responses. 35 1 
influences the results of any interviewing technique. 
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lems and toward various minority groups (cf. 33, p. 16). In one adaptation, 
for example, ambiguous pictures portraying labor situations were intermin- 
gled with the usual TAT cards. Other investigators have designed special sets 
of cards showing intergroup contacts and other relevant scenes. 

Some TAT adaptations have focused on the intensive measurement of a 
single need or drive, such as sex or aggression (13). Of special interest is the 
extensive research on the achievement motive conducted by McClelland and 
his associates (70). To measure the individual's need for achievement, Mc- 
Clelland selected four pictures, two of which were taken from the TAT. The 
Cards portray men working at a machine, a boy at a desk with a book, a fa- 
ther and son picture, and a boy who is apparently daydreaming. Detailed 
Scoring schemas have been developed for scoring the resulting stories with 
Tegard to expressions of the achievement drive (13, pp. 179-204; 70). 

One revision of the TAT was prepared for use with Negroes, since it was 
found that Negro respondents were often unable to identify sufficiently with 
the characters ‘portrayed on the original cards (97). In the revised form, the 
Original pictures were utilized, but Negro figures were substituted for white 
Persons on all cards except a few whose natur 
Sary. Several subsequent studies have shown, however, that neither total 
Number of words nor number of idea 
cantly between the original and Negro forms. Since Negroes living in the Amer- 
ican culture are unaccustomed to seeing 
Pictures tends to focus attention on racial 
and stereotypes among both Negroes and 


e made the change unneces- 
s in stories by Negroes differed signifi- 


pictures of Negroes, the use of such 
problems. The test may thus be 
Useful in exploring racial attitudes E 
Whites, but its advantages for clinical evaluation of individual Negroes have 
been questioned. Other variants of the TAT have been developed for use in 
different cultures. Such variants usually substitute culturally more appropriate 
Ituations, besides modifying the 
Although the TAT is said to 
Other e ol A ; 
ther forms for young subjects h es i P 
Ymonds Picture-Story Test for adolescents. the Children's Apperception Test 
(CAT) and the Michigan Picture Test. In the Symonds test (93, 94), the 
m nes representing situations of concern to 
ng teen-age boys or girls. The CAT 


appearance and dress of the characters. 
be applicable to children as young as 4 years, 
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ations, in the characteristic 
co The pictures are designed to 

Ose of feeding and other oral 


578 Measurement of Personality Traits 


tions, aggression, toilet-training, and other childhood experiences. Contrary to 
the author? assumption, several studies with children from the first grade up 
have found either no significant difference or—more often—greater produc- 
tivity with human than with animal pictures (12, 22, 46, 62). i 

The Michigan Picture Test (11) was developed in the course of an investi- 
gation of the emotional reactions of children between the ages of 8 and l4. 
conducted by the Michigan Department of Mental Health. The test consists 
of 16 TAT-like pictures, chosen to represent intrafamilial conflicts, feelings of 
personal inadequacy, sexual difficulties, and other emotional problems. Scores 
are based on seven psychological needs, four of which were found useful in 
discriminating between groups of well-adjusted and poorly adjusted children. 
Scoring is highly objective, interscorer reliability for all needs and all grade 
levels averaging .98. One of the scores computed is a "tension index,” show- 
ing the frequency of verbalized expressions of unresolved conflict. The au- 
thors report expectancy ratios indicating the probability that a child with a 
given tension index will fall into the well-adjusted or poorly adjusted cate- 
gory. In comparison with other projective devices, the Michigan test is un- 
usual in the quality of its test-construction procedures and of the accompany" 
ing manual. Although promising results are reported from a few studies. it I5 
too early to determine how useful the test will prove in practice. r 

Mention may also be made of several available story-telling techniques 
utilizing auditory stimuli. Such tests should prove useful in testing the blind 
or persons with defective vision, although they are not limited to these pul” 
poses. In one such test (14), 10 sets of sound situations are presented on 
records. A wide variety of sounds are included, such as typewriter, dialogue: 
foghorn, wind, explosions, and train crash. After listening to each set. tne 
subject makes up a story incorporating as many of the sounds as he can. AS 


in the TAT, the story should tell what happened, what led up to the sounds. 
and what was the outcome. 


The Blacky Pictures. Originally devised as a means of investigating Ce 
psychoanalytic hypotheses, The Blacky Pictures (23) concentrate on the É 
sessment of various areas of psychosexual development. The materials consist 
of 10 cards with cartoon-like drawings. These drawings concern a dog name 
Blacky, who could be of either sex, as well as his mother, father, and sibling— 
who could also be of either sex. The test was designed for adults, but iS ge 
scribed as being suitable also for children. The procedure is similar to that 2 
the TAT, in that the subject tells a story about each picture. As cach card P 
presented, however, the examiner adds a preliminary statement that sttüC" 


a little more than in th 
completion of each story, the subject is 


rtain 


1 e i ER -ng the 
tures the situation e usual projective test. Following 


: ons 
asked a set of standardized questi? 
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Although an ingenious research tool, this test can have little practical value 
until more empirical information is obtained as to what it measures. 

Make A Picture Story. Still another variation of the storytelling procedure 
is illustrated by the Make A Picture Story (MAPS), developed by Shneid- 
man (91). Like the TAT. this test is presented to the subject as a measure of 
his imaginative and creative abilities. In this case, however, the subject first 
c situations by choosing figures to go with a given 
cene. The test provides 


Creates his own dramati 


background, and then he develops a story around the s 
22 pictorial backgrounds, ranging from such highly structured ones as a 
living-room or a bathroom to such 
Cave-like opening. One completely bl 
are 67 cut-out cardboard figures. including 19 male 
12 children, 10 minority group figures (such as 


ambiguous ones as a dream, a stage, or a 
ank background is also included. There 
adults, 11 female adults, 


2.4 > 7 
? adults of indeterminate sex. 
J 

Negroes, Jews, Orientals), 6 legendary z $ 

Claus), 2 animals (dog and snake), and 5 silhouettes and figures with blank 

faces, Most of the figures are fully clothed, but a few are partly clothed or 


or fictitious characters (such as Santa 


nude, 


Figure 119 shows the ass 


ortment of characters furnished with the MAPS 


Do 119 terials for Use with MAPS (Make A Picture Story) Technique. (Courtesy 
Bere aterials for Use 
S¥chological Corporation.) 
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test, as well as one of the backgrounds, a forest scene, against which four 
figures have been arranged on wooden stands. In administering the test. the 
examiner presents the backgrounds one at a time, asking the subject to select 
one or more figures, place them against the background, and “tell a story 
about who the characters are, what they are doing and thinking and how they 
feel, and how the whole thing turns out.” Usually about half of the back- 
grounds are employed, the particular choice varying somewhat with the na- 
ture of the specific case. At the end, the subject may be asked to choose his 
own background from those that are left. 

A formal scoring system has been worked out that takes into account 
which figures are chosen, how many are used, where they are placed. how 
they are handled by the subject, and what relationships they are said to bear 
to each other. Since the chief aim of this test is the investigation of the "psy 
chosocial aspects of fantasy production,” the scoring categories relate princi- 
pally to interpersonal relationships. Scoring elements were selected by crite- 
rion keying against a schizophrenic and a normal sample. Available data, 
however, are insufficient to permit an evaluation of the concurrent validity 
of the proposed scoring key. It is also possible. of course, to submit the content 
of the stories to a thematic analysis of the type employed with the TAT. 


COMPLETION TASKS 


In completion tasks, the subject may be required to complete sentences: 
stories, arguments, or conversations. All available projective completion tests 
utilize verbal material, although some combine pictorial and verbal stimuli: 
The subject's responses, however, are always verbal, but they may be Te 
ported orally or in writing. Completion tests lend themselves to both grouP 
and individual administration. Since they provide opportunities for conten! 
analysis as well as for formal scoring, they are 
for a variety of purposes. 

Sentence Completion. 


H : xinglV 
being employed increase 


Unlike the incomplete sentences found in measure? 
of verbal aptitude, those utilized in projective tests 
pletions. Generally, only the opening words 
required to write the ending. A few typic 


s - Hs son 
permit highly varied C? : 
f E FM ss eng 
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al examples are shown below: 

I feel 

What annoys me... 
My mind 

If I had my way.. 
Women 
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As in the development of adjustment inventories, the construction of projec- 
tive Sentence completion tests has been characterized by extensive borrowing 
of items. It is therefore difficult to trace the original authorship of items or 
Sets of items. Several current versions, moreover, have many common items. 

As a typical illustration, we may consider the Rotter Incomplete Sentences 
Blank (86). This test comprises 40 incomplete sentences, or "stems." The 
directions to the subject read: "Complete these sentences to express your real 
feelings, | ry to do every one. Be sure to m 
Completion is rated on a seven-point scale according to the de 
Illustrative completions corresponding to 


ake a complete sentence.” Each 
gree of adjust- 
Ment or maladjustment indicated. 
With the aid of these specimen re- 


Ponses, fairly objective scoring 1s possible. The sum of the individual ratings 
in be used for screening purposes. 


Cac C " ` 
ach rating are given in the manual. 


Provides a total adjustment score that ca 
The response content can also be examined clinically for more specific diag- 
Nostic clues, Validation studies have yielded some promising results. The 
Manual gives a well-balanced, conservative evaluation of the strengths and 
Weaknesses of the test, thereby providing a welcome contrast to the exagger- 
ated claims made for most projective instruments. 
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test include family, opposite sex, social and friendship relations, vocation, 
religious and moral beliefs, and health. A quantitative scoring. aon 1S 
provided, but the empirical data on which it was based are very inadequate. 
Responses can also be subjected to qualitative content analysis. 

Argument Completion. Murray and Morgan (77) prepared an piod 
completion test as a means of exploring attitudes. The subject is given l 
cards, on each of which is printed a brief description of the beginning of an 
argument between two young men. In each case, the subject is to continue 
the argument to its termination, employing a realistic dialogue. The examinee 
is given the impression that his powers of argumentation are being tested. It 
is expected, however, that in the process of carrying out the task he will re- 
veal which side of the argument he favors. 

Rosenzweig Picture-Frustration Study. Combining pictorial and verbal DLE 
terial, the Rosenzweig Picture-Frustration Study (P-F) was designed n 
terms of the author's theory of frustration and aggression (85). This test !5 
available in a form for children (4 to 13 years) and a form for adults (14 
years and older). Each form comprises a series of cartoon-like drawings 
depicting two principal characters, One of these persons is involved in ü 
mildly frustrating situation of common occurrence; the other is saying some 


thing that either occasions the frustration or calls attention to the frustrating 


circumstances. The subject is instructed to write in the blank caption bow 
what the frustrated person would answer. He is urged to give the very ae 
reply that comes to his mind. The frustrating situations are of two pes 
(a) "ego-blocking," in which some obstruction, personal or impersonal. im 
pedes, disappoints, deprives, or otherwise thwarts the individual directly: à? 
(b) “superego-blocking,” in which the individual is insulted, accused. or 


OE UNES " 
otherwise incriminated by another Person. An item from the adult form ! 
reproduced in Figure 120. 
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mity rating (GCR). showing the subject's tendency to give responses that 
agree with the modal responses of the standardization sample. may also be 
Obtained. 


I'm very sorry 
we splashed 
your clothing 
just now 
though we tried 
hard to avoid 
the puddle. 


r 
AU 


of Pin. 120. Tlustrative Item from Rosenzweig P-F Study. (Reproduced by permission 
Si > strative s g 

Saul Rosenzweig.) 
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also been made to gather norms and to check its reliability and ie is 
its present form, however, its value as an objective instrument has fone dee 
established. Its use must therefore be circumscribed by the same cautions 
applying to other projective techniques, ' A EEN 

Like a number of other projective instruments, the P-F Study has Te 
adapted by other investigators for research on attitides (ef. 33. re 
general approach of the P-F Study has been followed in specially PE: 
tests for studying attitudes toward minority groups (27) and opinions on 
prevention of war (45). 


CHOICE OR ORDERING DEVICES 


Tests in this category require the subject to choose the items or arrange- 
ments that best fit a specified criterion such as meaningfulness or capa 
ness. Such tests necessarily present a more highly structured stimulus i 
and require simpler responses from the subject than most other project!" 
techniques. As a result, objective, quantitative 
but they are likely to be 


scoring systems are applicable. 
laborious owing to the many possible aie 
nations of responses. Although the two examples cited in this section s 
quite dissimilar in a number of ways, both m Mee 
tween projective tests and the "objective" 

the next chapter. They are also 


ay be regarded as falling 
personality tests to be discusse 
alike in that both utilize pictorial material. bic 
Szondi Test. In the Szondi Test (40, 95, 96) the subject is shown 48 pr E 
graphs of psychiatric patients of both sexes. BE 
set contains pictures represe 
sexual, a sadistic murderer 
à paranoid schizophrenic, 
structed to select the 
most" 


grouped into sets of cight. s 

è x ee eee ay HON 

nting the following diagnostic categories: à h 2 
g g renic: 


> an epileptic, a hysteric, a catatonic schizoph iiis 


à depressive, and a manic. The subject yon 
two pictures he "likes most" “disli -ic 
in each set, He is not, of course, imn 
classification of the ost DO 
administered at le 
interval between 


and the two he 
given any indication of the psyc! 
persons photographed. It is re 
ast six, and preferably ten, 
administrations. 

To score the Szondi Test, 
liked and disliked in e 
assumption that e 
"need-systems," 


commended that the t 


»-day 
. . - © c 
times with at least a On 


es 
it is necessary first to find the number of ps 
ach of the eight categories. The test is based 00 ©- 
ach person can be described in te 
Which correspond to the cate 


that the selection or 
relative tension existing in these 


is quite complex, t 


rms of eight en is 
gories of the photographs. the 
rejection of photographs indicates “es 
need-systems. The final interpretive ea 
Cunt not only the relative number of ph 


further assumed 


aking into acc 
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graphs of each kind chosen, but also the interplay of the need-systems and 
the temporal changes in the subject's responses. 

The Szondi Test is probably one of the least promising of the currently 
Popular projective techniques. Its theoretical rationale—both as originally 
Presented by Szondi and as revamped by Deri—appears particularly weak 
and farfetched, Attempts at empirical validation of its various assumptions 
have so far yielded overwhelmingly negative results (25). 

Tomkins-Horn Picture Arrangement Test. The development of the PAT 
(99) was influenced both by the general approach of the TAT and by the 
Authors’ observation that personality factors seemed to affect performance 
9n the Picture Arrangement test of the Weschsler-Bellevue (cf. Ch. 12). 
Each of the 25 "ráms of the PAT consists of three sketches presented in a 
Tound-robin arrangement so as to minimize positional set. The subject's task 
'S to indicate the order of the three pictures “which makes the best sense" 
ANd to write a sentence for each of the three pictures to tell the story. All 
items deal with interpersonal relations. Since the test was originally designed 
for use in the selection and guidance of industrial personnel, more than half 


Of its ; P n 
its items portray a work situation. 
Administration of the PAT is simple 1 
"antitative scoring is completely objective but time consuming. However, 

the test can be hand scored by a clerk or machine scored. Any subject's scores 

are based on those responses that are rare (frequency less than 5 per cent) 

IQ, and educational level. It is the uncommon re- 

believed to be diagnostic for him. The 


and suitable for testing large groups. 


ron Pesis of his -— dne 
author, in a subject gives tha be d 
Spo tà have worked out 655 scoring p 

nse combinations. Once the quantitative score p 
aar the individual, it is then interpreted by the clinical psychologist who may 


also 
refer to the subject's written mate 
he 
Contrib 


atterns, representing significant re- 
attern has been identified 


rial at this stage. 

] system of pattern scoring is one of the 
rthy feature is to be found in its un- 
f the Gallup Poll organization, 
constituting a representative 


attempt to develop an empirica 
Utions of this test. Another notewo 


ISu; i ; 
t ally good norms. Through the resources O 
1500 persons 
In addition, comparative data were 


atients in 84 clinics and mental 
time made possible the publi- 


e pam 
Sam AT was administered to 
o ad of the United States population. " 

“Med on over seven hundred psychiatrie P 


Spit ame 
als. A «+ given at the same 

ah » vocabulary test give S 3 c Sai " 

“ation Of tied erga ference to intelligence as well as age, educa- 
ti Subgroup norms with refe 


` and other demographic variables- 
n z is — 

Or: the negative side. no data are P 
al st nterv 


rovided on interitem consistency. Tem- 
al proved to be low. And validity data 


ability over a three-week i 
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are lacking. On the whole, the PAT has some ingenious features; it provides 
far better norms than most projective techniques; and it should be useful m 
research or as an adjunct to the clinical interview. But like many personality 
tests, it is not ready for general psychometric application. 


EXPRESSIVE METHODS 


As pointed out earlier in the chapter. expressive methods difler from cone 
struction techniques such as the TAT and MAPS in that the former consider 
the subject’s method or style as well as the characteristics of the finished 
product. Another distinguishing feature of projective techniques in the pe 
ent category is that they serve as therapeutic as well as diagnostic devices 
(64). Through the opportunities for self-expression that these techniques af- 
ford, it is believed that the individual not only reveals his difficulties but m 
relieves them. The principal projective techiques in this category include 
drawing and painting, play activities, and psychodrama. wire 

Drawing and Painting. The use of drawing and painting for diagnostic m 
well as therapeutic purposes has a long and voluminous history (cf. 5; 6; 7; 
8; 19, Ch. 17-19; 52). Few standardized instruments have emerged from 
this vast area of clinical practice and research. It will be recalled that draw" 
ings have also been utilized in non-projective testing, as illustrated by x 
Goodenough Draw-a-Man Test described in Chapter 10. In the restandard" 
zation of that test by Harris, however, projective uses of the drawings 25 y 
approach to personality were also explored (52). 


1 i : " atter 

Although almost every art medium, technique, and type of subject mat f 
has been investigated in the search for significant diagnostic clues, special i 
tention has centered upon drawings of the human figure. A well-known P 


ample is provided by the Machover Draw-a-Person Test (71). In this 1°, 


š p » f j ct o pli 

the subject is provided with a letter-size sheet of paper and a medium s : 
cil, and is i E E «truce 

pencil, and is told simply to “draw a person.” For young children, the instr 


tions may be altered to “draw somebody" or “draw a boy or a girl.” pea 
completion of the first drawing, the subject is asked to draw a person of A 
opposite sex from that of the first figure. While the subject draws, the d 

iner notes his comments, the sequence in which different parts are draw” e^ 
other procedural details, The drawing may be followed by an inquiry: ^ 
which the subject is asked to make up a story about each person drawn: hot 
he were a character in a play or novel." A me questions is employ” 
during the inquiry to elicit Specific information about age, schooling. occuP' 

tion, family, and other facts associated with the characters portrayed. 
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Scoring of the Draw-a-Person Test is essentially qualitative, involving the 
Preparation of a composite personality description from an analysis of many 
features of the drawings. Among the factors considered in this connection are 
the absolute and relative size of the male and female figures, their position 
on the page, quality of lines, sequence of parts drawn, stance, front or profile 
View, position of arms, depiction of clothing. and background and grounding 
effects, Special interpretations are given for the omission of different bodily 
amount and distribution of details, erasures, 
also detailed discussion of the 
individual facial features, 


Parts, disproportions, shading, 
Symmetry, and other stylistic features. There is 
Significance of each major body part. such as head, 
hair, neck, shoulders, breast, trunk, hips, and extremities. 

a-Person Test abounds in sweeping 
arge heads will often be given 
isease," or "The sex given the 


The interpretive guide for the Draw- 
peneralizations, such as “Disproportionately l 
Y individuals suffering from organic brain d 
Proportionately larger head is the sex that is accorded more intellectual and 
Social authority,” But no evidence is provided in support of these statements. 
ference is made to a file of “thousands of drawings” examined in clinical 
Contexts, and a few selected cases are cited for illustrative purposes. No sys- 


“Matic presentation of data, however. accompanies the published report of 


t € test, 

Validation studies by other investipators have yielded conflicting results. 
Attempts to develop semi-objective scoring procedures which utilize rating 
"Cales or checklists have met with little success (cf.. e.g.. 9). The test may be 
More Successful with children and other relatively naive subjects than with 
‘Phisticated adult groups. Although it appears to differentiate between seri- 


e i eae eye. 
'Y disturbed persons and normals, Its discrimi | 
Some of the pertinent studies are inconclusive 


minative value within relatively 


fare: Broups is questionable. 
8 to their failure to cross-validate. 
Another drawing test that has aroused cons 
ha nt of relevant research prope 
told Technique (H-T-P) devised by Bu ergo 
» 9 draw as good a picture ofa qeu diu Meanwhile, the examiner 
t "8 repeated in turn with “tree” and “person. Mea > ai 
akeg Copious notes on time, sequence of parts drawn. spontancous comments 
Sis Subject md pues of emotion. The completion of the drawings 


: followeq by 1 inquiry including a long set of standardized questions. 
an oral i ; l: NEM aped 
is drawings am vie d both quantitatively and qualitatively, chiefly on 
s are and cd 
asis Of their f. - ctylictic characteristics. 
heir formal or stylistic A a ates ee 
n discussing the rat xe underlying the choice of objects to be drawn, 
ssing the rationa 5 


iderable interest, as witnessed by 
is the House-Tree-Person Pro- 
(28). In this test, the subject is 
the same instructions 
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Buck maintains that “house” should arouse associations concerning the sub- 
ject’s home and those living with him; “tree” should evoke associations 
pertaining to his life role and his ability to derive satisfaction from his environ- 
ment in gencral; and “person” should call up associations dealing with oe 
personal relations. Some clinicians may find helpful leads in such aue 
when they are considered jointly with other information about the individua’ 
case. But the elaborate and lengthy administrative and scoring procedures 
described by Buck appear unwarranted in the light of the highly inadequate 
nature of the supporting data. 

In addition to drawing, several other art media have been tried in both 
diagnostic and therapeutic situations. Outstandin 
painting (cf. 10, Ch. 14), Finger paints are spread directly with the hand 
upon a large sheet of glossy paper. They thus combine some of the aspects 
of painting and modeling. Simplicity of use is 
technique. Subjects with no previous 


Mir 
g among them is finge 


an important asset of this 
art training or experience can readily 
produce many pleasing and satisfying designs. The fact that such paints can 
be casily washed off with clear water is ; 
children, they also provide some of the attraction of playing with mud. € 
ing, or simply “making a mess.” These features of finger paints are consi" 
ered particularly important by some clinicians. 

In the interpretation of finger paintings ific 
paintings and drawings, some attempts have been made to attach speci" 
diagnostic significance to the objects portrayed and to the formal character” 
istics of the products. Systems of more or less uniform symbols, often ol : 
Psychoanalytic nature, have been proposed for such purposes. Pending rl 
empirical demonstration of the validity of specific signs, however. such 1? 


3 A ; : nt. 

terpretations are of dubious value. In their present stage of developme! 
g 

drawing and p 


ainting techniques can probably serve best by providing isa 
for the clinician to follow up through interviewing or other procedures. 

Play Techniques. Play and dramatic objects, such as puppets, dolls, a 
and miniatures, have also been utilized in Projective testing (cf. 19, Ch. dod 
Originating in play therapy with children, these T sine 
quently been adapted for the di 
The objects are usually selected 
Among the articles most freque 


8 age, For 
a further practical advantage. F 


s Nd: a ah al 
. as in the projective use of 


materials have 
agnostic testing of both adults and children" 
because of their expected associative valus: 

ntly employed for these purposes, oras 
ample, are dolls representing adults and children of both sexes and varion 
age levels, furniture, bathroom and kitchen fixtures, and other householc 


TES ? i à eS 
furnishings. Play with such articles is expected to reveal the child’s attitude 
toward his own family, as well con 


À as sibling rivalries, fears, aggressions- 
flicts, and the like. The examine 


: . ha 
r notes what items the child chooses. W 
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he does with them, his verbalizations, emotional expressions, and other 
Overt behavior. 

With children, these techniques often take the form of free play with the 
Collection of toys that the examiner simply makes available. With adults, the 
Materials are presented with general instructions to carry out some task of a 
highly unstructured nature. These instructions may. of course, also be em- 
Ployed with children. Frequently the task has certain dramatic features, as 
a miniature stage set. Several investigators 


i at "n 
n the arrangement of figures on 
prejudice and other intergroup atti- 


have used play techniques in studying 
tudes (35, p. 17), 

One attempt to standardize projective "toy tests" 
World Test, First developed in England by Lowenfeld (68), this test has 
andardized by Buhler and her associates (29, 
and others (et. 19, Ch. 23). The materials 
es—from 150 to 300 in different 
bridges, trees, cars, fences, and 


is represented by the 


been adapted, revised, and rest 
30), Bolgar and Fischer (24). 
Consist of a large number of miniature piec 
°'ms—including houses. people. animals, aaa 
Other common objects that might be found outdoors. The subject is told to 
Construct whatever he would like, using a large table top. the floor, or a 
Sandbox as a base. [n the various adaptations of the World Test, the re- 
SPonses have been evaluated in a number of ways. Some interpretive systems 
Tely Chiefly on formal properties of procedure and product, such as sequence 
oF Pieces chosen, number and variety of objects included, rigidity of organiza- 
tion, and the like, Others place more emphasis on content and symbolism. 
“Companying verbalizations and other expressive reactions are also con- 
Sdereq- j 

Psychodrama. In this technique (74; 19, Ch. 24), the subject himself enacts 
Various rol “director” (examiner-therapist) guides the 
Bener procedure. The active participation of an audience is also considered 
a Tan integral aspect of this process. The ac aine is ione in 

Number of appropriately chosen situations. For examp x " a je i 

structed that he is on the stage with an ade as ane takpe 
"vent a relationship with this character. identifying ed poen PR 
°c, and activity. The subsequent action and esce ar EA rey 
r to the subject. This situation is designed to reveal what a zu re ipi 
P means to the subject, as well as his manner of communicating with an- 


er : < nons. additional live actors, known as "auxiliary 
Sos play s banus iie e Examples of some of the other 
u Wations Ace acting pon real or imaginary objects; silent action that 
“lize. ih in ‘ : oross bodily movements; rapid shifting from one 
Dus fo rS agat: do of roles. in which the subject and another per- 


es on a stage, while a 
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son trade places in the course of a scene. Into the various situations, the di 
rector may introduce a wide range of themes, such as love, death, family, 
economic problems, status, and security. . 

From a practical point of view, the application of psychodrama is lim- 
ited by its demands in time, personnel, and physical facilities, although sim- 
plified versions are often employed. Lack of adequate standardization, with 
regard to administration, recording, and especially interpretation, is a serious 
drawback. A more fundamental limitation stems from the absence of validat- 
ing data. It has been asserted that this technique needs no validation, since it 
utilizes samples of actual behavior. To be sure, all tests do so. But the ques 
tion of validity remains nevertheless. To what extent does the subject's be- 
havior on the psychodrama stage correlate with—or to what extent does it 
serve as a dependable indicator of—his behavior in other situations? 


CRITICAL EVALUATION 


It is evident that projective techniques differ widely among themselves: 
Some appear more promising than others because of more favorable emp!" 
cal findings, sounder theoretical orientation, or both. Regarding some tech- 
niques, such as the Rorschach, voluminous data have been gathered, albeit 
their interpretation is often uncertain. About others little is known, either be 
cause of their recent origin, or because objective verification is hindered bY 
the intrinsic nature of the instruments or by the attitudes of their exponent 

To evaluate each instrument individually and to attempt to summarize the 
extensive pertinent literature would require a separate volume, Within this 
chapter, critical comments have been interjected only in the cases of instr" 


ments that presented unique features—whether of a favorable or unfavor” 


able nature. There are certain points, however, that apply to a greater A 


lesser extent to the bulk of projective techniques. These points can be con 


veniently considered in summary form. 


Rapport and Applicability. Most projective techniques represent an effective 
means for “bre: 


$ aking the ice” during the initial contacts between subject an à 
examiner. The task is usually intrinsically interestine and often entertaining 
to the subject. It tends to divert the subject’s attention away from hims¢ 
and thus reduces embarrassment and defensiveness. And it offers little OT ne 
threat to his prestige, since any response he gives is “right.” 

It should be noted Parenthetically, however, that — projective tech 
niques may have poor face validity for normal and superior adults—@ fact 
that diminishes their acceptability for industrial selection, military screeni”? 
or classification, and similar purposes, Wechsler’s objections to the adult !* 
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Re child-oriented intelligence tests (cf. Ch. 12) apply with even 
ne er cogency to certain projective techniques. It is not difficult to predict 

> reactions of an Air Force pilot when confronted with four anthropomor- 
Phic Portraits of dogs labeled “Blacky,” “Papa.” “Mama,” and “Tippy,” re- 
Spectively! 

On the other hand, projective techniques may be especially useful with 
ee illiterates, and persons with language handicaps or speech 
Fs id Non-verbal media would be readily applicable to all of these groups. 
ee responses to pictorial and other non-language stimuli could be se- 
Mens rom the first two. With all these verbally limited groups. projective 
M may help the subject te communist with the examiner. These 
nos may also aid the subject in claritying for himself some of his own 

x S that he had not previously verbalized. ; : 

a ing. In general, projective instruments are less susceptible to faking than 
bs self-report inventories. The purpose of projective techniques is usually 

Sguised. Even if an individual has some psychological sophistication and is 


ae with the general nature of a particular instrument, such as the 
hei or TAT, it is still unlikely that he can predict the intricate ways 
I which his responses will be scored and interpreted. Moreover, the sub- 
1 Soon becomes absorbed in the task and hence is less likely to resort to 
customary disguises and restraints of interpersonal communication. 
On the other hand, it cannot be assumed that projective tests are com- 
Pletely immune to faking. Several ents with the Rorschach, TAT, 


3 experim 
nd hat significant differences do 


ve shown t 


Other projective instruments ha 
heir responses so as to create fa- 


cae subjects are instructed to alter t he enna ae 
Beti € or unfavorable impressions, OF when they bu given 8 mm Sug: 
ing th responses are more desirable (72). In a particu- 
and Pildner (39) administered a battery 
wo groups of college students, one of 
plicants, the other as participants in a 
the job applicants obtained signifi- 


arly sie certain types of on 
Of self- -controlled study. Davids 
Which Mn and projective tests to t 

ook the tests as genuine job ap 


Tesca MM 
can arch project. Under these conditions, 
tly better-adjusted scores than the research subjects on the self-report but 


hot bjec thè 
3 jecti i ective test items, however, 
N the projective tests. Certain types of proj ; 


c i i ce completion stems ex- 
found to be susceptible to faking. Thus sentence p Oa 
ienificantly more favorable responses than 


Pr s 
x in the first person yielded sig 
Med in the third person. > 
quat ardization. It is obvious that mo P ae 
ther ely standardized with respect © both adm ahon b 
* is evidence that even subtle differences in the phrasing of verbal in- 
"Uctions and in examiner-subject relationships can appreciably alter per- 


rojective techniques are inad- 
and scoring. Yet 
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formance on these tests (15, 72, 88, 90). Even when employing ent 
instructions, some examiners may be more encouraging or reassuring. ee 
more threatening, owing to their general manner and appearance. Such wi 
ferences may affect response productivity, defensiveness, stereotypy, imag! 
nativeness, and other basic performance characteristics. In the light of these 
findings, problems of administration and testing conditions assume even 
greater importance than in other psychological tests. " 

Equally serious is the lack of objectivity in scoring. It will be recalled i > 
even when objective scoring systems have been developed, the final steps s 
the evaluation and integration of the raw data depend upon the skill D 
clinical experience of the examiner. Such a situation has several ipe 
tions. In the first place, it reduces the number of examiners who are properly 
qualified to employ the technique and thus limits 
application. It also means th 
may not be comp 


the range of its effective 
A ee x iners 
at the results obtained by different exami! 


^ : esate 
arable, a fact that complicates research with the instr 
ment. But perhaps the most disturbing im 


ute ; reting of 
plication is that the interpreting 
Scores is often 


as projective for the examiner as the test stimuli are for i 
subject. In other words. the final interpretation of projective test response? 
may reveal more about the theoretical ori 
personality idiosyncrasies of the ex 
personality dynamics. T 

Norms. Another conspicuous deficiency common to most projective I 
struments pertains to normative data. Such data may be completely lacking: 
grossly inadequate, or based upon vaguely described populations. In the ue 
sence of adequate objective norms, the clinician falls back upon his iua 
clinical experience" to interpret projective test performance, But such : 
frame of reference is subject to all the distortions of mem 
selves reflections of theoretical bias, preconceptions 
of the clinician. Moreover, any one clinici 
ited largely to persons who are atypical in education, socioeconomic level. 
sex ratio, age distribution, or other reley 
respect, the clinician's experience is almost cert 


picture, since he deals predominantly with maladjusted or pathological cases 
He thus lacks sufficient firsthand familiarity with the ch 
of normal people. The Rorschach norms g 
ates on children, adolescent. and persons Over 70 represent a recent efr 
to correct some of the more obvious lacks. The representative nationwid? 
sample examined in the Standardization of the Tomkins-Horn Picture AT 
rangement Test is an outstandi 


: bate ie 
; ME Ing exception to the traditional practices f 
lowed in projective test development, 


x : ; ; an 
entation, favorite hypotheses. un 
" ^ subiect: 
aminer than it does about the subje 


^n^ 

ory that are the" E 

«s merasies 

» and other idiosyncras s 

; s m- 
ans contacts may have been li 


Te ast one 
ant characteristics, In at least p y 
; :eleadin£ 
ain to produce a mislead 


$E actions 
aracteristic react? i 
di 

athered by Ames and her asso 
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eap rcgem d PoE E test performance often involves subgroup 
tits : v ix 3 SUDJECH Ves oF an objective nature; Thus the clinician may 
Vias g naa subjective picture of what constitutes a “typical” schizo- 
cn ion psychoneurotic peitonmante on a particular test. Or the published 
A provide qualitative or quantitative norms that delineate the char- 
stic performance of different diagnostic groups. In either case, the sub- 
pl may lead to faulty interpretations unless the subgroups were 
eee in bii respects. For example, if the schizophrenics and normals on 
iere he nol ts Were derived differed also in mean educational level, the 
d disparities between schizophrenic and normal performance may 


have res : z 
e resulted from educational inequality rather than from schizophrenia. 


Simi 
ilar systematic or constant errors may operate in the comparison of vari- 


E Psychiatric syndromes. For example. schizophrenics as a group tend to 
kee than manic-depressives; anxiety neurotics are likely to come 
igher educational and socioeconomic levels than hysterics. 

ioe In view of the relatively unstandardized scoring procedures and 
O ati of normative data, scorer reliability becomes an important 

MESI For projective techniques, a proper meas- 
lude not only the more objective prelimi- 
ive and interpretive stages. It is not 


is ation in projective testing. 
€ of s des i 
9f scorer reliability should inc 


nary « 5 : 
en Y scoring, but also the final integrat 
Sugh, ate that examiners who have mastered 


agree closely in their tallying of such 
or color responses. On a projective 
measures cannot be inter- 


the san ion example, to demonstr: 
Hun e of Rorschach scoring 
test n as whole, unusual detail, or co 
Preteq c the Rorschach, these raw quantitativ ies , ia 

: directly from a table of norms, as in the usual type of psychological 
^n Interpretive scorer reliability is concerned with the extent to which dif- 
i examiners attribute the same personality characteristics to the sub- 

On the basis of their interpretations of the identical record. 


conducted on the scorer reliability of pro- 


ew; ; à 
ket; Y adequate studies have been : o. 
. "VE tests. Some investigations have revealed marked divergencies in the 

Med 5 of - x 
à t users. A fundamental 


|-qualified tes 
nown contribution of the inter- 


“Mbiguity ; : 
Suity in such results stems from the unk i 
iability can be directly general- 


m those utilized in the particu- 
ar; 
vestigation. M 
“tempts to measure other types of test reliability have fared equally 
er in the field of projective testing. Coefficients of eq VEEE OF MRE 
a T 
Consistency, when computed, have usu uch tests as the 


Ors 
Schach and TAT, it has been argued th 


Par, W* qu. ws 
"e and hence should not be used in finding SP 


ally been low. In s 
at different cards are not com- 
lit-half reliabilities, One 
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solution, of course, would be to construct a parallel form that is comparable. 
It should also be noted that in so far as responses to different Rorschach (€ 
TAT cards are combined in arriving at a total estimate of any personality 
characteristic, intercard agreement is assumed. Yet empirical results a 
rarely supported this assumption. A careful investigation of TAT ipa 
ity, for example, yielded internal consistency correlations ranging from —- 
to .34 for 10 of the major themes, such as achievement. ag 
(34). 

Retest reliability also presents special problems. With long je 
genuine personality changes may occur which the test should detect. With 
short intervals, a retest may show no more than recall of original responses. 
When the investigators instructed their subjects to write different TAT sto- 
ries on a retest, in order to determine whether the same themes would recur. 
most of the scored variables yielded insignificant retest correlations (65)- K 
is also relevant to note that many scores derived from projective techniques 
are based upon very inadequate response samples. In the case of the Ror” 
schach, for instance, the number of responses within a given individual 5 
protocol that fall into such categories as 
ment, shading, color, unusual detail 


ression, etc. 


nu 


i ; ve- 
animal movement, human pen 
‘ À à è 
» and the like may be so few as to y! 
: SM ee de 
extremely unreliable indices. Large chance variations are to be expec 


$ " " " un- 
under such circumstances. Ratios and percentages computed with such 


i egal? asures 
reliable measures are even more unstable than the individual measu" 
themselves (cf. 36, pp. 411-412). i 

Validity. For any test, the most fundamental question is that of validity: 


Most empirical validation studies of projective tests have been concern? 
with concurrent validity. Of these, m 
contrasted groups. such 


ations of concurrent V cn: 
s in which personality descrip ie 
ith descriptions or data about 


= A " range 
: 15, psychiatric interviews, or long- - " 
behavioral records. A few studies have investigated predictive validity 


against such criteria as success in Specialized types of training or respons? 
to psychotherapy. There has been an increasing trend to investipate the co” 
struct validity of projective instruments by testing specific hypotheses that UI 
derlie the use and interpret h test. This approach is illustrate 


eee stra" 
^ eprivation, drugs, anxiety, and dew a 
tion on test performance, as well ag by research on examiner and situatio™ 
variables. 


same subjects taken from case histories 


ation of eac 
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All too often the advocates of a projective technique have based their 
claims on content validity, which in these cases means essentially that the 
technique appears to fit some particular personality theory. Such assertions 
can do no more than provide hypotheses for empirical verification. Men- 
tion should also be made of what has been variously described as “clinical 
validity," "subjective validity," and "faith validity." This is simply a feeling 
Of satisfaction, unaccompanied by any demonstrable or communicable proof. 
Teported by some clinicians who use a given technique. It is outside the 
Tealm of science. 

The large majority of published validation studies on projective tech- 
Niques are inconclusive because of procedural deficiencies in either experi- 
Mental controls, statistical analysis, or both (cf. 36, 37, 41). This is es- 
Pecially true of studies concerned with the Rorschach. Some methodological 
deficiencies may have the effect of producing spurious evidence of test 
Validity where none exists. An example is the contamination of cither crite- 
"On or test data. Thus the criterion judges may have had some knowledge of 
the Subjects’ test performance. Similarly, the examiner may have obtained 
Cues about the subject’s characteristics from conversation with the subject in 
the course of test administration, or from case history material and other 
TOn-test sources. The customary control for the latter type of contamination 
In Validation studies is to utilize "blind analysis," in which the test record is 
Interpreted by a scorer who has had no contact with the subject and who 
Nas no information about him other than that contained in the test protocol. 
ig EE source of spurious gehen data d n ideni cud 

cf. Ch. 7). Because of the large number of potential diagnostic signs 
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" Scorable elements that can be derived from most projective tests, It Is very 
ch siens that differentiate significantly 
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however, wi 


Ween Criterion groups. The validity of such a scoring key, 
“lapse to zero when applied to new samples. ; 

Madequacies of experimental design may also have the effect of underesti- 
Mating the validity of a diagnostic instrument. It is widely recognized, for 
“xample, that traditional psychiatric categories, such as schizophrenia, manic- 
hysteria, represent crude and unrealistic classifi- 
disorders actually manifested by patients. Hence if 
d as the sole criterion for checking the 


ee psychosis, and 
Suc ge the personality 
Validit ‘agnostic categories are use 
Stress Y of a personality test. negative T 
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complex patterns of relationship among personality variables was illustrated 
by the previously cited studies of aggression indicators in TAT stories. l 

Few research projects have been so designed as to avoid all the major pit- 
falls of projective test validation. The advanced student is urged to examine 
for himself the reports of such studies as those of Henry and Farley (54). 
Little and Shneidman (66), and Silverman (92), which set a high standard 
in experimental design. While differing in type of subject examined and in 
specific problems they set out to investigate, these studies point to a common 
conclusion: when experienced clinicians are given an opportunity to examine 
and interpret in their own way subjects’ protocols from such projective tests 
as the Rorschach, TAT, and MAPS, their evaluations of the subjects’ person- 
alities tend to match independent case history evaluations significantly better 
than chance. In so far as can be ascertained, however, the obtained relations 
are low. Moreover, the relationship appears to be a function of the particular 
clinician and subject, a number of individual matches being no better than 
chance. There is also little agreement among evaluations based on different 
projective techniques, or among different clinicians using the same technique. 

Current Trends. When first introduced, many projective techniques were 
surrounded by an atmosphere of cultism that tended to insulate them from 
the main stream of psychological research. The isolationism characterizing 
the exponents of such techniques retarded both the effective development 
and evaluation of the techniques and their acceptance by the psychological 
profession. One of the chief developments of personality testing during ihe 
1950's was the attempt on the part of many psychologists to bring projective 
techniques into the main body of psychological science through a clarification 
of underlying theoretical rationale as well as carefully designed experimenta 
tion. 

In the rapidly accumulating research findings on projective techniques, W? 
can recognize certain trends pointing toward future developments. First, well- 
designed validation studies utilizing blind analyses and adequate experimen- 
tal controls have demonstrated that even under optimal conditions such tech- 
niques have inadequate validity to justify their use as psychometric tools in 
making individual decisions. The utilization of projective techniques as tests 
thus appears to be on the decline. 

Nevertheless, certain projective techniques may serve a useful function 25 
interviewing aids in the hands of skilled clinicians. The realization that this 
is the proper role of projective techniques represents a second major trend. 
In this connection, qualitative content analysis is proving more fruitful than 
formal scoring categories. Furthermore, since the clinician is an intrinsic part 
of the interviewing situation, research on the effects of interpersonal eR 
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aminer and subject variables on projective performance is relevant to this 
approach (51. 72. 100). 

Borrowing à concept from information theory, Cronbach and Gleser (38, 
P. 128) characterize interviewing and projective techniques as “wide-band” 
Procedures. Bandwidth, or breadth of coverage. is achieved at the cost of 
lowered fidelity or dependability of information. Objective psychometric 
tests characteristically yield a narrow band of information at a high level of 
dependability. In contrast, projective and interviewing techniques provide a 
much wider range of information of lower dependability. Moreover, the kinds 
of data furnished by any one projective technique may vary from individual 
to individual (38, p. 129). One person's TAT responses, for example, may 
tell us a good deal about his aggression and little or nothing about his cre- 
ativity s achievement drive; another person's record may permit a thor- 
Ough assessment of the degree of creativity or rigidity of his behavior and 
Of the strength of his achievement drive, while revealing little about his ag- 
gression, Such a lack of uniformity in the kinds of information provided in 
individual cases helps to explain the low validities found when projective 
test responses are analyzed for any single trait across a group of persons. 

It is interesting to note that a similar unevenness characterizes clinicians’ 
interpretations of individual records. Thus in their study of the validity of 
the TAT, Henry and Farley (54, p. 22) conclude: 


There is no single correct way of employing the TAT siteMpectation There was 
little item agreement between judges. but each judge made enough ‘correct de- 
Cisions to yield a highly significant agreement figure. Judges may arrive at essen- 
tially the same interpretive implications of the test report, by quite SUTerent routes 
Or judges may differ individually in their ability to utilize TAT predictions in 
different areas . . . or for different subjects. 


ugh which projective and interview- 
8 


The nature of “clinical judgment thro pect ; 
about individual cases is re- 


r & dat 1 ane ; | SUIS 
a may be utilized in reaching decisions 
i ention from psycholog sts (56, 69. 73). In this process 


coories in terms of which the data are organized are 
m an examination of the particular combination 
d 1 case. The special function of the clinician 
are combinations of events about 
atistical table or equation. By cre- 


ceiving increasing att 
the very constructs or cat 
built up inductively throu 
i data available in the individua 
358 to make predictions from unique or T 


Whict MA T re any st 

1 it is impracticable to prepar Lee = 

ating ne P : it the individual case, the clinician can predict from 
8 new constructs to fit the 


“Ombinations of events that he has never encountered before in any other 
s h 


fase In making his predictions, he can also take into account the varied sig- 

i : s ing S s A ert c ^hnicz adi 1 N 

nificance sf imil [aas for different individuals. Such clinical predictions 
similar s 


598 Measurement of Personality Traits 


are helpful, provided they are not accepted as final but are constantly tested 
against information elicited through subsequent inquiry, test responses. re- 
action to therapy. or other behavior on the part of the subject. It follows 
from the nature of interviewing and projective techniques that decisions 
should not be based on any single datum or score obtained from such sources. 
These techniques serve best in sequential decisions, by suggesting leads for 
further exploration or hypotheses about the individual for subsequent verifi- 
cation. 

A third trend can be seen in the increasing use of projective techniques as 
raw materials for the development of objective tests. In this capacity, pro- 
jective techniques provide hypotheses, such as possible relationships between 
specific aspects of perceptual responses and personality traits, around which 
structured tests can be devised. Such tests are objectively scorable, reducing 
to a minimim the role of the examiner as data gatherer or as data interpreter. 
They are typically narrow-band procedures, secking to provide dependable 
information about a small segment of personality. To these tests, projective 
techniques are contributing not only a variety of fresh hypotheses but also 
testing procedures that minimize faking and facade effects. These objective 
personality tests are among the newer techniques for personality assessment 
to be surveyed in the next chapter. 
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Other Techniques for 


Personality Assessment 


The paper-and-pencil inventories and the projective techniques surveyed 
in the preceding chapters undoubtedly represent the best-known and most 
widely used types of instruments for the appraisal of personality. Neverthe- 
less, there still remains a rich supply of other devices that are being explored 
for this purpose. Out of this diversity of approaches may come techniques 
that will eventually revitalize the measurement of personality and stimulate 
progress in new directions. 

The techniques to be considered in this chapter are extremely heteroge- 
neous in nature. Moreover, they do not fall into clear-cut or generally recog- 
nized categories, although several attempts have been made to work out 
schemas of classification and appropriate terminology (cf. 18, 19, 72, 79): 
For the purposes of the present discussion, these techniques have been 
grouped into the following four types, each to be treated in a separate section: 
situational tests; tests utilizing perceptual, cognitive, or evaluative tasks; tests 
based on sociometry; and tests designed to assess self concepts and person? 
constructs. Only a few examples of each type will be described for illustra" 
tive purposes. 

A few of the instruments to be considered ma 


ial 
: : y be regarded as spec? 
adaptations of either self-report inventories or pr 


: 3 in 
Ee ojective techniques. AS ! 
most classifications, borderline Specimens can be found that could be placed 


in more than one category. In a later section of the chapter, we shall cor- 
sider some further procedures for the appraisal of personality. These are not 


properly speaking psychological tests and hence Will receive only brief men- 
tion. 


604 


Other Techniques for Personality Assessment 605 


SITUATIONAL TESTS 


Although the term "situational test" was popularized during and follow- 
ing World War II, tests fitting this description had been developed prior to 
that time. Essentially, a situational test is one that places the subject in a situ- 
ation closely resembling or simulating a “real-life” criterion situation. Such 
tests thus show certain basic similarities to the "worksample" technique em- 
ployed in constructing trade tests (cf. Ch. 17). In the present tests, however, 
the criterion behavior that is sampled is more varied and complex. Moreover, 
interest is focused, not upon aptitude or achievement, but upon emotional, 
social, attitudinal, and other personality variables. 

Tests of the Character Education Inquiry. Among the earliest situational 
tests—although they were not so labeled at the time—were those constructed 
by Hartshorne, May. and their associates (43, 44, 45) for the Character Ed- 
ucation Inquiry (CEI). These tests were designed principally as research 
instruments for use in an extensive project on the nature and development of 
ildren. Nevertheless, the techniques can be adapted to other 
a number have been so utilized. 

In general, the CEI techniques made use of familiar, natural situations 
within the school child's daily routine. The tests were administered in the 
form of regular classroom examinations, as part of the pupil's homework, 


in the course of athletic contests, or as party games. Moreover, the children 
were being tested, except in so far as ordinary 
d in the procedure. At the same time, 


character in ch 
testing purposes, and 


were not aware that they 
School examinations might be involve 1 ‘ 
all of the Hartshorne-May tests represented carefully standardized instru- 
ments which yielded objective. quantitative scores. ; u 

The CEI tests were designed to measure such behavior characteristics as 
honesty, persistence, inhibition. and service (cooperation and ep da The 
largest number of tests in the CEI series were concerned with d heating. One 
group of tests developed for this purpose utilized a duplicating technique. 
In this technique, common tests such as vocabulary, sentence completion, or 
arithmetic reasoning were administered in the clesstoom: The test papers 
Were then collected and a duplicate of each child’s responses was made. At 
the original, unmarked test papers were returned, and 
each child scored his own paper from a key. Comparison with the duplicate 
record revealed any changes the individual had made in scoring his paper. 

The same types of test content Were employed with the double testing tech- 
nique. In this case, two equivalent forms of the test were administered, one 
nditions that provided an opportunity to copy from a 


a subsequent session, 


under unsupervised co 


Measurement of Personality Traits 
606 


key, and the other under supervised conditions that ORA dE the he 
of cheating. The maximum drop in score that could be expected bon 
cheating was established empirically by administering the two forms of each 
test under supervised conditions to a comparable group of subjects. A Variant 
of the same basic technique was employed in athletic contests in which the 
individual reported his maximum achievement on tests of strength of grip, 
lung capacity, chinning, and broad jump, following supervised “practice 
trials.” 

A third technique for the detection of cheating was based upon improb- 
able achievement. In this case, a task was given under such conditions that 
achievement above a certain empirically established level indicated cheating. 
Among the tasks utilized for this purpose were weight discrimination, the 
solution of various mechanical puzzles, and paper-and-pencil tests of motor 
coordination. An example of the last-named type of test is provided by the 
Circles Puzzle. In this test, the Subject was instructed to make a mark in cach 
of 10 small, irregularly arranged circles, while keeping his eyes shut. Control 
tests under conditions that precluded peeking indicated that a score of more 
than 13 correctly placed marks in a total of three trials was highly improb- 
able. By peeking, however, the child might obtain a higher score. 

Honesty was also measured by 
Stealing and lying 
Square—an 


means of a few tests designed to detect 
behavior. Tests of stealing can be illustrated by the Magic 
arithmetical puzzle requiring coins for its solution. The object of 
the puzzle was to arrange coins in such a way that the sums of all rows: 
columns, and di pon completion of the task, subjects 
containing the puzzles and coins. 


agonals were the same. U 
were instructed to return the boxes 
Through code numbers on the puzzles, it was possible for the examiner to 
identify any subjects who failed to return all the coins. 

An example of a lying test from the H 
by a written questionnaire consisting of ite 
in the L-scale of the MMPI (cf. Ch. 18). 
questions to be answered "yeso 


n0" by the subject, Typical questions are 
shown below (43, pp. 98-99): 


Did you ever act greedily by taking more than Your share of anything? 


Are you always on time at school or for other appointments? 
Do you always smile when things go wrong? 


Did you ever say 


anything about your teacher that you would be unwilling 
to say to her face? 
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On the basis of empirical investigations, it was decided that a subject who 
answered 24 or more of these questions in the socially approved direction 
was probably lying. 

To measure persistence, the subjects were given different tasks appealing 
to a variety of interests and were permitted to work as long as they wished. 
Persistence scores were based upon the time the subject spent on the prob- 
lem before voluntarily stopping. The tasks included a variety of mechanical 
and paper-and-pencil puzzles, as well as a story completion test. In adminis- 
tering the story completion test, the examiner read three stories involving 
danger or adventure. Each was stopped at a point of suspense. The subjects 
Were then given sheets on which the ending of the story was printed in a 
form that made reading difficult. Moreover, as the end of the story was ap- 
proached, the difficulty of deciphering the words was further increased by the 
Spacing of the letters and by the indiscriminate mixture of capitals and 
small letters. The three successive levels of difficulty are illustrated below 


(44, p. 294): 


CHARLESLIFTEDLUCI LLETOHISBACK*PUTYOURARMSTIGHT 


AROUNDMYNECKANDHOLDON 
NoWhoWTogETBaCkONthETREStle.HoWTOBRingTHaTTerrlflEDBURDeN 
OFAChILDuPtOSafeTY 


fiN ALly tAp-tAPC AME ARHYTH Month eBriD GeruNNing fee Tfee Tcom 
INGtow ArdT Hem 


The subjects were instructed to draw lines separating the words as they read. 
The examiner could thus determine how far each igdividual = PERE 
before giving up, as well as the time devoten: tim ae si e ens 
Parenthetically that, regardless of how this test is edi e x i ch à ^ 

ure is likely to depend upon a egy aptitude variables, quite apart from 
the subject’ i e in the given task. 

ae at aiea rediit the subjects to resist the pote h ols 
to more attractive stimuli than were provided by the ts 5 han x or ex- 
ample, an arithmetic test was administered to the same i se er iar 
mal and distracting conditions. For the latter purpose, the problems were 


Presented sheet covered with cartoon-like sketches, verbal comments, 
onas 


x The larger the difference between the two 
and oth iscellaneous doodles. g ; Ter poen 
Scores e "tes successful the subject had been in resisting the “pull” of the 
, the 


dis ra i i . . " " " " 
re Seon tests proved to have good discriminative power, yielding 
st of the 
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a wide range of individual differences in scores. Reliability also appeared to 
be fairly satisfactory. Although it was not feasible to obtain estimates of re- 
liability for all tests, most had retest or alternate-form reliabilities in the 
.70’s and .80's. Validity was more difficult to establish. Correlations with 
ratings by teachers and classmates were generally low, but such ratings were 
themselves quite unreliable and of doubtful validity. In evaluating what to- 
day would be called the construct validity of their instruments, Hartshorne 
and May found intercorrelations of the many different tests within each cat- 
egory. These correlations proved to be low, indicating considerable specificity 
in the behavior surveyed by the tests. Although tests utilizing similar tech- 
niques correlated highly with each other, when all honesty tests were con- 
sidered, their average intercorrelation was only .227. Moreover, several of the 
individual correlations between honesty tests were practically zero. Similar 
results were obtained when tests within the persistence, 


inhibition, and service 
areas were intercorrelated. 


It should be noted that the variables investigated in the Character Educa- 
tion Inquiry represent categories derived primarily from the ethical and 
social evaluation of behavior. In so far as can be determined from available 
data, however, these categories do not correspond to "traits" in the sense in 
which this concept is used in the factorial analysis of behavioral organization. 
For example, the relative strength of different interests, values, or motives 
for a given individual may determine his persistence or his tendency to cheat 
in specific situations. The child who is motivated to excel in school work i$ 
not necessarily concerned about his achievement in athletic contests or social 
games. And the child who lies to win approbation may not be at all inclined 
to pilfer coins. When first published, the findings of the CEI research aroused 
a storm of controversy. From the standpoint of test construction, however. 
this project not only contributed many ingenious techniques but also demon- 


strated that standardized procedures and Objective, quantitative scoring could 
be used successfully with real-life situations. 


Situational Stress Tests. 


Like the tests developed in the CEI program, 
situation 


al stress tests are realistic and disguised. The principal difference in 
the case of stress tests is to be found in the introduction of features designed 
to arouse anxiety or to produce other emotional disturbances. Among the 
stimuli that have been employed to induce emotional stress may be men- 
tioned: electric shock; expectation of shock or pain that does not materialize: 
falling and disruption of body balance; startle-producing stimuli, such as loud 
noise, sudden explosion, water spray, or air blast; threat of ical 
danger; distractions; criticism and razzing; 
of an assigned task; failure or threat of fail 


apparent phys 
time pressure in the performance” 
ure; and a variety of interperson? 
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conflicts involving examiners, observers, co-workers, or helpers. A number of 
these conditions derive their stressful nature from the fact that the subject is 
highly motivated to succeed in the task at hand or to make a favorable im- 
pression upon his observers. 

The subjects reactions to stress may be determined through the use of 
physiological measures of emotion, objective records of performance, or 
qualitative observations and rating procedures. In many cases, a combination 
of these measures may yield the most satisfactory information. Whenever 
feasible, it is also desirable to make observations during a control, or pre- 
stress period, as well as during a post-stress period (47). The control obser- 
vations provide a measure of normal performance for each individual, in 
terms of which his behavior under stress may be evaluated. The post-stress 
period permits an appraisal of the individual's recovery from Stress. A post- 
stress interview is likewise helpful in obtaining supplementary information re- 
garding the subject’s emotional involvement, his interpretation of the situation, 
and other questions bearing on the intensity of stress actually experienced by 


each individual. ; 
One type of situational stress test is illustrated by the Observational Stress 


Test developed during World War I as part of a test battery for the selection 
of Air Force pilots (41, pp. 660-663; 61, pP. 811-814). This rest sus de- 
signed to measure the subject's ability to resist confusion and distraction in 
performing a complex coordination task similar to that of piloting a plane. 
ovided seven controls—a pedal, a stick, and five levers— 
Which were to be reset continually by the subject in response to signal lights 
and a buzzer. Electric clocks recorded the speed with which correct settings 


Were made throughout the test period. In addition to the confusing nature of 
| antly changing task, executed under time pressure, 
ticism and threat of failure through such stand- 


The apparatus pr 


the complex and const 
the examiner introduced cri 
ardized comments as the following: 

t work more quickly . . . Your scores are not 
ating you the same way a primary 
You will have to do things exactly 
a simple test like this confuse 


. You mus 
Remember we are T 
our flying - + - 
Are you letting 


Set the controls . . 
nearly good enough yet. 
instructor would rate you on Y 
right or you are through - » > 
you? ... 
projects concerned with the development and 
that undertaken by the Office of Strategic 


Servi during World War II (63, 66). The object of this testing pro- 
i i ews eat of candidates for assignment to military intelligence. 
d Was e evalua 


am consisted of a three-day session of inten- 
The principal assessment program 


Among the most ambitious 
Use of situational stress tests Was 
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sive testing and observation. During this period. the candidates lived together 
in small groups, under almost continuous scrutiny by members of the as- 
sessment staff. Besides specially constructed situational tests, the program 
included aptitude tests, projective techniques. intensive interviewing. and gen- 
eral observations under casual and informal conditions. Several of the situa- 
tional tests were modeled after techniques developed in the German and 
British armies. 

An example of an OSS situational stress test is provided by the Construc- 
tion Test, in which a five-foot cube had to be assembled from wooden poles. 
blocks, and pegs. The subject was informed that, since it was impossible for 
one man to complete the task within the ten minutes allotted to it. he would 
be given two helpers. Actually, the helpers were psychologists who played 
prearranged roles. One followed a policy of inertia and passive resistance; 
the other obstructed the work by making impractical suggestions, asking ir- 
relevant and often embarrassing questions, and needling the candidate with 
ridicule and criticism. So well did the helpers succeed in frustrating the candi- 
dates that the construction was never completed in the history of the assess- 
ment program. The subject's emotional and interpersonal reactions in this 
situation were observed and evaluated qualitatively. 

Available validity data on situational stress tests are meager (41, 52. 60. 
66). A major difficulty arises from the lack of suitable criterion data. In the 
OSS project, for example, the diversity of individual assignments reduced 
the possibility of evaluating field performance under comparable conditions 
for different members of the group. Unreliability of criterion ratings also 
tends to reduce validity estimates. With these qualifications, it must be noted 
that, when determined, predictive validity of situational stress tests has 
proved to be low. Although a few techniques may be promising for special 
purposes, the contribution of these tests rarely justifies the time, equipment, 
and highly trained personnel that their administration requires. 


*Leaderless Group" and Related Techniques. A relatively common type 


of situational test utilizes a “leaderless group” as a device for appraising 


such characteristics as cooperation, teamwork, resourcefulness. initiative 


and leadership. In such tests, a task is assigned that requires the cooperative 


efforts of a group of examinees, none of whom is designated as leader OF 
sig s 


given specific responsibilities. Examples from the OSS program include the 
Brook Situation, involving the transfer of personnel and equipment across 4 
ieee ENG. Ime eni aro SNe dui safety; and the Wall Situation, in which 
men and materials had to be conveyed over a double wall separated by an 
imaginary canyon. 


A promising variant of this technique is the Leaderless Group Discussion 
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(LGD). Requiring a minimum of equipment and time, this technique has 
been used widely in the selection of such groups as military officers, civil 
service supervisors and administrators. industrial executives and manage- 
ment trainees, sales trainees, teachers, and social workers (cf. 8). It has also 
been employed in research on leadership (10. 20), on the effects of counsel- 
ing (68), and on the selection of clinical psychology trainees (53). Essen- 
tially, the group is assigned a topic for discussion during a specified period. 
Examiners observe and rate each person's performance, but do not partic- 
ipate in the discussion. Although often used under informal and unstand- 
the LGD has been subjected to considerable research 


ardized conditions. 
(8). Bass (10) has worked out objective indices of each individual's leader- 
ship in terms of his effectiveness in changing group opinion. 

Validity studies suggest that LGD techniques are among the most effective 
applications of situational tests. Many significant and sizeable correlations 
have been found between ratings from LGD performance and follow-up or 
concurrent ratings obtained in military. industrial, and social settings (cf. 
8; 40, p. 262). Some of these correlations are as high as .60. It is also inter- 
esting to note that leadership ratings based on a one-hour LGD correlated 
Over .60 with leadership assessment from three days of situational testing in 
the OSS program (66). A similar correlation between LGD and an entire 
battery of situational tests was found by Vernon (88) ma British study. 

Information relevant to the construct validity of the LGD is provided by 
Correlations with certain personality tests, such as the Ascendance and So- 
Ciability scores of the Guilford-Zimmerman TRIEDEMNIRES npe hotii of 
Which yield significant p ations with LGD € It analy 
Ses of performance in leaderless group situations have bein wr ars 
factors: (a) "individual prominence"—eflorts to see T xx et bg 
individually achieve various personal goals; (b) og T ipia vium 
efforts to assist the group in achieving goals toward whic ne group is re e : 
X bility "—eflorts to establish and maintain cordia. an 
tions with other group members: l 
ther, more elaborate situational tests, however, have 

4 broad personality traits (52, 53, 60, 

be most effective when they approximate actual 
selten havior they are designed to predict. The 
lidity in predicting performance in jobs 
mmunication, verbal problem-solv- 
factor that seems to increase the 
job familiarity on the part of the 
hat situational tests work 


ositive correl 


and (c) "group socia 
Socially satisfying rela 

Neither LGD nor O 
Proved valid as device 
66). All such tests appe 
of the criterion be 
ave some Va 
nt of verbal co 
peers. Another 
nal tests 1S 
n suggests t 


s for assessing 


"worksamples" 
LGD tests in particular h 
requiring a certain amou 
ing, and acceptance by 
Predictive. validity of situatiol zs 
Taters (46, 88). Such a finding aga! 
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best when the subject’s performance is interpreted as a worksample rather 
than in terms of underlying personality variables. Flanagan (3, 32) has de- 
veloped a number of situational techniques that are closely modeled after 
actual job activities and are scored in terms of a checklist of desirable and 
undesirable actions. A typical problem used in evaluating naval officers re- 
quires the examinee to discuss with a junior officer an unsatisfactory per 
formance report he is about to submit (3, p. 2). The circumstances leading 


up to the unsatisfactory report are described and the examinee is then ob- 
served in his handling of the interview. 


PERCEPTUAL, COGNITIVE, AND EVALUATIVE TASKS 


A major trend in personality testing today is the development of a wide 
variety of simple and comparatively objective tests, most of which are of 
the paper-and-pencil type. In common with the previously discussed situa- 
tional tests, these techniques can be characterized as relatively structured 
and disguised. Rather than attempting to utilize complex, lifelike, or realistic 
situations, however, these tests present the subject with an artificial task 
bearing little or no resemblance to the criterion to be predicted. The tests 
under consideration represent efforts to identify behavior that may serve as 
a valid predictor of a criterion, without being a direct sample of criterion be- 
havior. For this reason, these techniques h 


ave sometimes been described as 
“indirect” tests. 


These tests have also been characterized by some writers as “objective 
tests." Such a designation, however, does not adequately differentiate them 
from other types of personality tests, since all psychological tests are objective 
when properly constructed, administered, and evaluated. 
we have scen, for example, how personality inventories and projective tech- 
niques can be used as objective tests, although frequently they are not so em 
ployed. When applied to personality tests, moreover, the term "objective 
has been given widely varied meanings by different writers (cf. 89). Camp" 
bell (19) uses the term in a very special sense to indicate that the subject 
pereeives the task as Having ui objectively correct solution. This definition 
would differentiate most tests in this section from both self-report and pro- 
jective tests. Cattell (21, 24) applies the term to all except self-appraisal 


techniques. Still other writers base their criterion of objectivity in personality 
assessment on interscorer consistency and the elimination o 


In earlier chapters 


f examiner vari- 
ables. 


Although there is as yet no satisfactory or 


: ; generally accepted label for the 
type of tests to be considered in this sectio 


n, we can readily identify the!" 
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common differentiating features. First, the subject is task-oriented, rather 
than. being report-oriented as in personality questionnaires. He is given an 
Objective task to perform, rather than being asked to describe his habitual 
behavior. Second, the purpose of these tests is disguised, the subject not 
realizing which aspects of his performance are to be scored. Third, the tasks 
set for the subject are structured. In this feature lies their principal difference 
from the tasks utilized in projective techniques. To be sure, structuring is a 
matter of degree. As a group. however, the tests now under consideration 
are more highly structured than are typical projective devices. 

A fourth and related feature pertains to the apparent existence of a “right 
oblem—at least from the subjects viewpoint. 
Thus many of the tests are presented in the guise of aptitude measures, in 
Which the subject endeavors to give “correct” answers. True, the instruments 
ptitude tests, nor are they scored on the basis of 
bject's approach to the test is neverthe- 


Solution" for each task or pr 


are not actually treated as a 
right and wrong responses. But the su 
less quite unlike that encouraged by projective tests. in which "anything 
goes." Finally, the rationale underlying the construction of such tests is often 
: personality style, or broad stylistic traits of be- 
ested in a wide variety of dissimilar activities or 
ach is similar to that of projective tech- 


based upon the concept of 
havior, which may be manif 
“media.” In this respect. the appro 
niques, 

Major research projects that 
Personality tests in the present category 
direction of Thurstone (84. 87). MacKinnon ( 4). 
(31), and others. In addition to these comprehensive and continuing proj- 
ive been engaged in developing single tests of the 
Same nature for specific purposes. It might be added that isum ad dd 

Iso be included in the present category. 


Cussed in earlier chapters could a 
This is true. for example of the various sorting and perceptual tests de- 
* k . d ' 


Scribed in Chapter 12. the Porteus Mazes cited in S im s 
the more highly standardized projective techniques pm x: i pm Foe 

Perceptual Functions. One of the pimp ser ipa iu of ees dod 
Personality tests under consideration 1$ to be foun e; nd m Tai 
functions. A rapidly growing body of ihe ica wnat ie 
Strated significant relationships between the individual s al, 


ti and his performance on perceptual or 
Ong s ; , : 
y nal, or emotional charact iT iaar Pa 
Cognitive tasks (cf.. egs 14 17- 


, . hach—are essentially 
* number of projective techniques— 


Perceptual tests. 
Of the factors identified 


include the development of a number of 
have been conducted under the 
59), Cattell (21, 24), Eysenck 


ets, other investigators hz 


eristics 
49, 91). I 
notably the Rorsc 


-a factorial analyses of perception, two that have 
in fa t 
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proved particularly fruitful in personality research are speed of closure gan 
flexibility of closure (67, 82). The first involves the rapid recognition. of r 
familiar word, object, or other figure in a relatively unorganized or pera 
visual field. Typical items from a test found to be highly saturated with this 
factor (Street Gestalt Completion) are reproduced in Figure 121. Flexibility 


Speed of Closure (Street Gestalt Completion): What does each 
picture represent? 


Flexibility of Closure (Gottschaldt Figures): Which of the f 
drawings at the right contain the design at the lefi? 


B 
Fig. 121. Sample Items Illustrating Perceptual Tests. (From Thurstone, 83. p. 7) 


of closure requires the identification of a figure amid distracting and confus- 
ing details. Two items from a test with a high loading in this factor (Gott 
schaldt Figures) are also shown in Figure 121. Several studies have report? : 
. : H 1 : $ ~ [5 
suggestive data indicating possible relationships between each of these facto! 
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and personality traits (59, 67, 82). In one investigation (67), for example 
persons who excelled in speed of closure tended to rate themselves as m 
ciable, quick in reactions, artistic, self-confident, systematic, neat and pre- 
ese; and disliking logical and theoretical problems. In contrast, those scor- 
ing high in flexibility of closure had high self-ratings on such traits as socially 
retiring, independent of the opinions of others, analytical, interested in theo- 
retical and scientific problems. and disliking rigid systematization and rou- 
tine. 

In other tests, the subject is presented with conflicting perceptual cues. 
For example, in the Stroop Color Word Test (82), the subject first reads 
printed color names from a card as quickly as possible; on a second card, 
he names the colors of rows of dots as fast as he can. The third card contains 
color names printed in colors that do not correspond to the names. For ex- 
might be printed in red. The subject is required to 


ample, the word "blue" 
gnoring the words, Finally, he 


name each color as rapidly as possible while i 
is told to read each word while ignoring the colors. Increase in time required 
from trial 1 to 4 (reading words) and from trial 2 to 3 (naming colors) shows 
the extent of blocking caused by the conflicting cues. 

Several tests have been designed to provide objective indices of color or 
It will be recalled that this is one of the response determi- 


form dominance. 
ach scoring. Although there is evidence 


ional Rorsch 
individuals to favor either form or color in their 
of different form-color tests are low (51). One 
86, 87), presents a motion-picture film 
in which small colored figures move in a straight line across a clock face. If 

ite of flickering shapes, he will re- 


the individual follows constant color in sp! 
if he follows constant shape in spite of flick- 


nants emphasized in tradit 
of a slight tendency for 
Perceptions, intercorrelations 
Such test, developed by Thurstone ( 


Port movement in one direction: ; 
the opposite direction. 


ering colors, he will report movement in 

Conflicting perceptual cues also underlie the tests devised by Witkin and 
his associates (91) in a 10-year study 
Through various tests utilizing a rod and 
and a tilting room, 


of perceptual space orientation. 
frame that could be independently 
Moved, a tilting chair. these investigators were able to 
show that individuals differ widely in their "feld dependence,” or the ex- 
lent tà which their perception of the upright is influenced by the surrounding 
Visual field. Considerable evidence Was amassed to indicate thar this pate 
Ceptual trait is a relatively stable, consistent characteristic, having a certain 

odd-even and retest reliabilities were high, 


Mount of generality. Thus both : : : á 
and most of the intercorrelations among the different spatial orientation tests 
Wete duin j 

hes significant. Of even more 
Wee " . voz 
veen the orientation tests and 


interest are the significant correlations be- 
Embedded-Figure Test (similar to the 


the 
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Gottschaldt Figures illustrated in Figure 121), 
measuring field dependence in a purely visual, 
The authors also report some sugg 
ence scores and personality chara 
be checked under more carefully co 

Controlled Verbal Association. 


which may be regarded as 
paper-and-pencil situation. 
ationships between field depend- 
cteristics, but these relationships need to 
ntrolled conditions. 


estive rel 


Another type of personality test is based 
Ssociations. Such tests are similar to the 
ed in the preceding chapter, the principal 
restrictive nature of the tasks assigned in 


I5 type were devised for use in the previ- 
ously cited Thurstone Project (84, 85). One example is the Verbal Em- 
phasis Test, designed to measure diff 


erences in the speed with which the sub- 
ject can make cognitive and ive discriminations in the meanings of 
words. Sixty word pairs are Projected serially on a Screen, the subject being 
Y Pressing the appropriate key, which of the two 
ne half of the Pairs require “cognitive” discrimina- 
arge.” The other half call for affective discrimina- 
Positively or negatively toned words. The former 8 
"interested —enthusiastic the latter by "miserable—unhappy- 
It is thus possible to compare median reaction time for cognitive and affec- 


tive discriminations, itively and negatively toned word- 


words is the stronger. O 
tions, as in “colossal—] 
tions between either 
illustrated by 


re 
as well as for pos 
pairs within the latter category. 


A second illustration is provided by a s 


are two parts to this test, in each of whi 
Screen, one at a time. In the first 


or antonym, for each word, In t 
he gives a Synonym. The adjecti 
positive, stating complimentary 
scribing uncompliment 


ynonyms-Antonyms Test. There 
ch adjectives are exposed on @ 
Part, the subject is 
he second part 
ves in each lis 
facts al 


asked to give an opposite, 
» Administered on another day. 
t are of three types: affectively 
bout people; aflectively negative. eod 
ties; and affectively neutral, referring 
amp, Or legible. Three scores can be 
eses regarding Personality indicators. 
can be compared on: (a) synonyms 


and neutral stimuli, and (c) compli- 
ary adjectives, 


is employed in 


affectively toned 
mentary and uncompliment 

An ingenious technique a Recognition Test. The subject is 
provided with a stack of c; f which is printed a word with onc 
O read each word aloud as soon as 
randomly in the list are 50 homonyms, which can 
verbs or non-verbs (usually nouns). An example 
accented on either syllable. The hypothesis under- 


is instructed t 


be pronounced as either 
is "object," which can be 
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lying this test is that persons giving predominantly verb-responses are more 
active than those who give a majority of noun-responses. 

Aesthetic Preferences. Still another approach utilizes aesthetic preferences 
as indices of personality factors. As a part of the previously mentioned 
MacKinnon project, Barron (6) and Barron and Welsh (7) asked subjects 
to indicate whether they liked or disliked each of a set of drawings and small 
Colored reproductions of paintings. The results showed certain correspond- 
€nces between the type of drawing and the type of painting preferred by in- 
dividual subjects. With regard to personality, subjects showing a preference 
for simple-symmetrical drawings and realistic-traditional paintings described 
themselves sienificantly more often as contented, gentle, conservative, un- 
affected, patient, and peaceable. Those who liked complex-asymmetrical 
figures and "modern" paintings, on the other hand, chose such self-descrip- 
tions as gloomy, loud, unstable, bitter, cool, dissatisfied, pessimistic, emo- 
tional, irritable, and pleasure-seeking. Such findings are, of course, tentative 
until confirmed by cross-validation. But the nature of the particular adjectives 

h group suggests that there may be 


Chosen significantly more often by eacl sug 
More than a chance association between the subject's self-perception and his 


aesthetic preferences. 
In the effort to demonstrate th 


Personality measurement. Berg (12) Prep destin. dd Braet ray 
JM P ien, the subject responds 
Consisting of 60 abstract designs. To each desig ] P y 


Checking “Like Much,” “Like Slightly,” “Dislike Slightly,” or “Dislike 
Much.” Despite the simplicity and abstract nature of its content, the test 
apparently evokes response sets that are correlated with amher behavior mani- 
festations. Through an analysis of deviant responses on this test, for example, 
it proved possible to construct scales for diflerentiating several common 


Psychiatric disorders. 
Another attempt t 
Sonality tests is to be foun 


e unimportance of specific test content in 
ared a Perceptual Reaction Test 


o utilize aesthetic responses in the development of per- 
d in the IPAT Music Preference Test of Personality 
(25, 26, 28). This test consists of 100 short piano selections reproduced on 
two sides of a phonograph record. For cm musical excerpt, the subject 
records "Like," “Indifferent,” or “pislike.” On the MAE of factor analysis, 

, 1 groups, each yielding a separate factor 


the 100 items were classified into 1 ! ) ' 
score. Only 7 of these factor scores are interpreted, however, owing to the 


e ining four. 
very low odd-even reliability of the remaining 
Evidence for the validity of the IPAT Music Preference Test of Personality, 


as well as for the psychological interpretations of the music preference fac- 
tors, was derived in part from correlations with Cattell's Sixteen Personality 


Factor Ouestionnaire (cf. Ch. 18). Additional evidence was based upon 
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certain relations discovered with Psychiatric syndromes, 
press the essential common characteristic 
factor, musicians were asked to analyz 
categories. The type of results obtained 
cording to the musicians’ descriptions, 
liking for music which builds 


In an attempt to ex- 
s of the items classified under each 
e the selections within the various 
may be illustrated with Factor 1. AC- 
a high score on this factor indicates 


axed mood through resolved har- 
d controlled expression." For the 
Suggested that a high score is sd 
m from acute neurotic symptoms 
ce of mind, trustfulness, and free- 
her hand, appeared to be related to 


" alt n 
» bodily Symptoms of emotionality 
and suspiciousness, This factor further distinguished between 


ics ; en "P oing 
normals and psychotics at the .0 of Significance, the difference being 
largest in the case 


Humor. 


overwroughtness, exhaustion 
resentment, 


1 level 


ise been explored as possible in- 
A 23. 

T Humor Test of Personality je 

artoons to be evaluate 


cach pair. In 
" for each joke or cartoon. ^ 


(a) Epitaph to a waiter: (5) One Prehistoric man to another: 
By and by N “Now that we've learned to CORTE 
God caught his eye. nicate with each other—shut up: 


Although Form A yields more reli 


Forms B and C Provides additio 
readiness to rate 


able scores, the arrangement followed in 


TREO eA eed 
the subject’s genera 
again grouped into clus- 

aeon analysis, a separate score be- 
ing found in each factor. 


Evaluation of Proverbs, Some 


Personality tests elicit subjects’ reactions to 


ple, in the “Famous Sayings” test developed 
ds to each of 130 statements by indicating 
whether he agrees, disagree is uncertain. Scales identified as Hostility, 
Mores were derived through factor analy- 


is also found on the basis of the subject's 
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general acceptance or rejection of items. Apart from the factorial analyses 
and a few significant correlations with certain personality inventory scores, 
Validity data for this test were based largely on significant differences be- 
tween various occupational, regional. educational, and clinical groups. 

Interests and Attitudes. Finally, mention may be made of the use of relatively 
objective, task-oriented tests in the appraisal of interests and attitudes. Some 
of the earliest attempts to measure interests centered around the use of tests 
of information, learning, and distraction (cf. 34). lt is certainly reasonable 
lo expect that an individual will more readily learn and retain informa- 
tion related to his interests, and that material appealing to his interests will 
prove to be more distracting to him than material in which he is not inter- 
ested. These early tests, however, did not prove as successful as interest in- 
ventories such as the VIB (cf. Ch. 19), and hence were soon abandoned. 
More recently, efforts to construct "indirect" tests of interest have been 
renewed (cf. 18, 22, 79, 90). 

A number of information test 
tudes. The knowledge an individual has 
Perception and retention of facts, as well 


We are likely to notice and to remember t à 
expectations or hypotheses. and to overlook and forget others. Moreover. 


is available on 
d by the subject's attitude; this tendency has been 


technique (42), in which the respondent is forced 


s have been developed for research on atti- 
acquired is apt to reflect his selective 
as his biased sources of information. 
hose facts that are in line with our 


When no information a given question, the direction of 
a 


guessing may be determine 


utilized in the error-choice 4 3 : 
to aooe Détweem DO equally incorrect alternatives reflecting opposed 


liens e mane itl error-choice tests are such that the correct answers are 
Not generally familiar to the iege Hi ee nenc the errors are not 
readily apparent. Attitude bias on the ey ad the subject is revealed by 

stenata errors iiine direction, as oppose to random errors. 
Anong the 11809 other techniques utilized for the measurement of attitudes 
may ba mentioned perception ‘and memory tests, in which distortions and 
as; the evaluation of arguments, syllogistic conclusions, in- 


errors reflect bi icti F 
rors ke; prediction of the outcomes of described events; estima- 


y li 

ferences, and the ; pe à 

tion of grouP opinions, tie estimated opinions. being presumably colored by 
Pers own Views: judging character from the photographs of persons 


the subject 


identified a5 members of different minority Broups, Occupations, etc.; ex- 


val or disapproval of pictured or described incidents involving 


s ro 
ressing APP. g. E E i 
epos relations; and rating jokes, some of which pertain to minority- 
group members F 
5 : r procedures are illustrate an investi : p 
still other P are illustrated by an investigation of the develop- 


ment of racial attitudes among children from the kindergarten through th 
e g e 
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eighth grade (48). In one test constructed for this study, 
graphs of pleasant-lookin 

with the instructions to * 
and so on until all are ran 


a set of 12 photo- 
8 Negro and white boys was shown to the PAM 
Pick out the one you like best, next best, next best, 
ked." The score was b 
ranks assigned to the Negro and white boys. T 
employed in another test in which the subjec 
graphs according to instructions such as the f 
you want to sit next to on a street car";* 


swimming with"; and “Show me all t 
cousin." 


ased upon discrepancies in the 
he same 12 photographs were 
t was asked to select photo- 
ollowing: “Show me all those 
‘Show me all those that you would a 
hose that you'd like to have for a 


The last-mentioned test Tepresents a pictorial and somewh 
aptation of an earlier tech 


Scale (15). In the Bogardu 
inventory, the subject is given a list of national, 


gious, and other special groups, with the instructi 


given types of relationships to which he would bi 
of the group. These relatio 


at disguised ad- 
nique utilized in the Bogardus Social Distance 
s test, which is more closely related to an attitude 
racial, socioeconomic, reli- 
ons to mark each of seven 
e willing to admit members 
nships range from “close kinship by marriage 
ationships checked by the subject 
he “social distance” of each group 


R ; sts 
8 novel and subtle ways of assessing Md 
7 s se 
r personality traits, A number of the propos 
techniques, moreover, reflect a healt 


SOCIOMETRY 


Sociometry is essentially a procedure for recording interpersonal attractions 
among the members of 4 group. The technique has been highly developed 
y Moreno (62), who applied it to a wide variety of groups 
and problems. Ordinarily it is used with a group of persons who have been 
together long enough to be acquainted with one another, as in a class, 
factory, institution, club, or military unit. Each individual is asked to choose 
one or more group members with whom he would like to study, work, eat 
lunch, play, or carry out any other designated function. Subjects may be 
asked to nominate as many group members as they wish, or a specified num- 
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ber (such as first, second, and third choice), or only one person for each 
function. 

Sociometric data may be analy 
is to prepare a sociogram (cf. 50, 


dus ce 
e 


i 


MER RE TU URS 


irls in a S are asked to choose a partner 

i iogr: i virls in a school club are aske [ 
with’ hos Ros cute Iis yt on a specific project: each is allowed two choices. 
whom they w 


/zed in various ways. A favorite procedure 
62). as illustrated in Figure 122. This 


ferences of a hypothetical group of eight 

as allowed two choices. In sociometric lingo, Claire is a 
epr by four of the eight girls. Jean is an “isolate” who 
Jue any choices. Although Helen and Nancy 
hey too received no choices. Some writers 
n; others reserve the term “un- 


diagram shows the expressed pre 
girls, when eac 
"star," having been chosen D; 
has neither made nor receiv! 


both chose a preferred partner, t 


1 vith Jea 
ssi s isolates along V d 
RR igo Pride (55). The sociogram also serves to reveal the 
Chosen" for this categ 55). 1 


: 2 Figure 122, Claire, Ruth, and Jud 
: iques arious sizes. In “18 : : y 
presence of cliques of bra through their mutual choices. Debbie and Mary 


S » x 1 ian 4 Š 
form a closely knit trang ry is also an intermediary between this 


; Ma 
consti ual pair, but ? . N 
CO NUI ina 1 Eos mentioned triangle. With larger groups, the socio- 
pair and the previo 


; s of group structure. 
es ea elven as put to many uses both in research and in the prac- 
Bapa cse. groups. The sociogram ei a = may serve as a 
basis for assigning individuals to i, SAGE n they wall function con- 
genially. Or it may suggest ways for improving the cohesiveness and effec- 
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; UR isolates: 
tiveness of the total group. Thus a particular group may have many is nd 
it may be torn apart by strong cliques; or it may exhibit other Lap i 
4 E RUE i Pee ifler- 
interfere with its unified functioning. Sociograms may be obtained on sn 

A " : " " ori 

ent occasions to determine the effects of intervening factors upon g 


S i x 2 Aon inorit 
structure. They may also be utilized in studying attitudes toward mir y 
group members within a group. 


From the standpoint of individ 


7 s $ Ton Ip in 
ual differences, sociometric data can hel} 
identifying isolates as well as ] 


In addition, several indices can be 
t of each individual (55, 62). Such 
of times an individual has ue 
à specific function or for all d 
permitted, the number of first iia 
When number of choices is unlimite : 
“expansiveness” may be found from the number of othe! 
persons the individual chooses. " 

Sociometric nominations have generally proved to be one of the most de- 
pendable of rating techniques. When checked against 
criteria dependent upon interpersonal relations, such 
to have good predictive validity (55), 
when we consider some of the features 
raters is large all group me 


a variety of pinsin 
ratings have been pes 
These findings are understandab ? 
of Sociometry. First, the number ee 
mbers. Second, an individual's pec 
sition to observe his typical behavior- 
certain interpersonal traits than teacher 
€rvers. Third, and probably most ios 

tant, is the fact the inions of group members—right or Mes Med 
ly determine the nature of the i ene 

subsequent interactions wi group. Other comparable groups may a 
expected to react towarg the individual in a similar fashion. Sociometri¢ 


EE) sc as 
Ve content validity in the same sense 
worksamples. 


: e á ; e 
An ingenious adaptation of sociometric techniques is provided by th 
Syracuse Scales of Social Relations 


> Prepared by Gardner and Thompson 


TOcedure and scoring. Designed princi- 
pally for use in the room, the Syracuse scales are available at gee 
levels: elementary (grades 5-6), junior high school (grades 7-9), an 

0-12). To provide a uniform and broad ae 
of reference, the individual is first instructed to select five names “from a 
persons he has ever known” to represent key points on his scale. Every mem- 
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ber of the class is then assigned a position in this scale with regard to two 
questions. These questions were selected from the Murray system of needs 
(cf. Chs. 18 and 20) as being especially important at each of the three levels 
Covered by the scales. Accordingly, the subject is asked to evaluate his class- 


mates as: 


Q a possible source of aid when troubled by a personal problem (succorance: 


included at all three levels) 


Q someone to help him do something well so people will praise him (achieve- 


ment-recognition: elementary ) 
Q someone to look up to as an ideal (deference: junior high) 
he would enjoy at a party Or recreation. (play- 


@ a person whose company 
mirth: senior high) 


Scores are based on the scale positions assigned to an individual by all 
his classmates. Several measures can be computed to describe both group 


and individual characteristics. For the individual, the midscore of ratings 


received indicates how his classmates regard him, while the midscore of 
as a group satisfy the particular 


ratings given shows how well his classmates 
need for him. An advantage of these scores is that they are comparable for 
all needs, individuals, and. groups (35). For further individual evaluation, 
percentile norms from a large standardization sample are also provided for 
care exercised in their development and the fa- 


each grade. In view of the 
research findings, these scales appear very 


Vorable nature of available 
promising for a variety of purposes. 


SELF CONCEPTS AND PERSONAL CONSTRUCTS 


A number of current approaches to personality assessment have concen- 
trated on the way the individual views himself and others. Such techniques 
reflect the influence of phenomenological psychology, which focuses on how 
events are. perceived by the individual (54, 76). The individual's self de- 
scription thus becomes of primary importance in its own right, rather than 
being regarded as a second-best substitute for other behavioral observations. 
est also centers on the extent of self acceptance shown by the individ- 
ual. Another common feature of all procedures to be considered in this sec- 
tion is their applicability to idiosyncratic, intensive investigation of the in 
dividual case. For this reason, they are of special interest onbe clinical 


Inter 
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psychologist. Many of them, 


slicitine a 
Self Conceptualization. directed toward eliciting 


ared by Gough (37, 38, 39). 
of 300 common adjectives, ar- 


; ssessment 
entioned personality assessme 
i i 1 : TM in- 
program directed by M; a versity of California. In one 


s were analyzed on the 
$ observed in the analysis would 
Ption, whether or not the description? 
(37, p. 1). This quotation typifies the 
self-concept tests. 


assumption that "any s 
reveal important aspec 


ts of self-perce 
could be accepted 


s ae MP RENE 
List or on Similar descriptive inve 


erformance on several of the percep 
1 : H x " S- 
and evaluative tests described in the preceding section. Ilu 


trative findings from the C and from investigations conducted 


tual, cognitive, 


alifornia project 


to say that self-report in- 
- That many psychologists ad 
ries in this light has already been noted in Chapters 

18 and 19, 
The interpretation of 
conceptualization form 
Loevinger (56, 57).B 


Personality 
S the basis of a 


ringing together 


inventory responses in terms of self 
Provocative hypothesis formulated by 
many disparate findings from her own 
Proposes a personality trait which she 
ize oneself, or to “assume distance 

ording to Loevinger, it is the manifes- 


j 1 i " ss h 
lity inventories that have been described in suc 
termis as: façade, test-taking defensiveness, response set, social desirability, 
acquiescence, and personal Style. In common with a number of other psy- 


i : : ds Gers al 
chologists, Loevinger regards Such test-taking attitudes, not as instrument 
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errors to be ruled out, but as the major source of valid variance in personality 
inventories. 

On the basis of data from many sources, Loevinger suggests that ability 
to form a self concept increases with age, intelligence, education, and socio- 
economic level. At the lowest point, illustrated by the infant, the individual 
s the ability develops, he gradually 


is incapable of self conceptualization. A 
acceptable concept of him- 


forms a stereotyped, conventional, and socially 
self. This stage Loevinger considers to be typical of adolescence. With in- 
creasing maturity, the individual progresses beyond such a stereotyped 
concept to a differentiated and realistic self concept. At this point, he is 
fully aware of his idiosyncracies and accepts himself for what he is. Loevin- 
ger maintains that the level of self conceptualization attained by the individ- 
ual is a basic determiner of his impulse control, social attitudes, and other 


important aspects of personality. 


: ; i st) persons fail to reach 
According to Loevinger, many (if not most) p o reach the 


final stage of differentiated self concept. In so far as personality inventory 
ated in terms of normative data, individuals whose self 
concepts are at the stereotyped conventional stage receive higher or “better 
adjusted” scores. In the course, oF s qi voie na 
vance beyond this stage to the individualize Se concept e hence may 
show a decline in scores on adjustment inventories. (57). Such a hypothesis 
could account for the apparent failures of personality inventories when used 
in a clinical setting. The finding by some investigators (e.g., 41) that, 
when evaluated in terms of personality inventory norms, college seniors ap- 
pear to have poorer emotional adjustment than college freshmen may have a 
similar explanation. Essentially, Loevinger "gue that the capacity for 
self conceptualization is an important personality trait and that the relation 
ADOBE E ersonality inventory scores is not linear but curvilinear. 
of this trait to gn techniques have been based on comparisons between the 
; Seyeral apes AS and his concepts of "average," “ideal,” or other 
individual d (13, 69, 70). Or, the individual's "private" self concept 
ecd aed with his ace self RS ie., his most accurate esti- 
S timself as he agit = ia E him (16). Many variations of 
Sa procedures could readily a ed to test specific hypotheses. Such 
P s lend themselves especially well to the detection of conflicts, 
technique? crepancies in the various concepts. Several have yielded promis- 
Ithough they are still in an experimental stage. 
^ Sort. One of the special techniques suitable for investigating self concepts 
sort developed by Stephenson (78). In this technique, the subject 
set of cards containing statements or trait names which he must 


responses are evalu 


a 
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sort into piles ranging from “most characteristic” to “least characteristic” of 
himself. The items may come from a standard list, but more often are de- 
signed to fit the individual case. To insure uniform distribution of ratings, à 
“forced-normal” distribution is used, the subject being instructed to place @ 
specified number of cards in each pile. Such a distribution can be prepared 
for any size of item sample by reference to a normal curve table. It should be 
noted that, like the forced-choice technique discussed in Chapter 18, the Q 


sort yields “ipsative” rather than "normative" data. In other words. the in- 


dividual tells us which he considers his strong 


g and which his weak traits, but 
not how strong he believes himself to be in com 


parison with another person 
or some outside norm. 

Q sorts have been employed to study a variety of psychological problems. 
When applied to an investigation of individual personality, the subject 1s 
often asked to re-sort the same set of items within different frames of refer- 
ence. For example, he may sort the items as they apply to himself and to 
other persons, such as his father, his mother, or his wife. Similarly, he may 
sort the items as they apply to himself in different settings, such as job, 
home, or social situations. Q sorts can likewise be obtained for the individual 
as he believes he 


actually is, as he believes others see him, and as he would 
like to be. To observe change, Q 


sorts may be obtained successively at dif- 
ferent stages during psychotherapy. The degree of similarity or difference 
among various Q sorts has often been found by computing correlations 
among them. When a large number of Q sorts has been obtained, the in- 
tercorrelations have sometimes been factor-analyzed to identify common 
elements or group factors through certain sortings. 

Q technique represents a systematization of self-rating procedures. It can 
also be employed as a basis for rating others. For example, a clinician or 
interviewer may record his evaluation of an individual by means of a Q 
sort. Although some of the statistical procedures that have been employed in 
analyzing Q-sort data are questionable (29), the Q sort provides a uscful 
rating technique for b 


oth Psychological research and practice. 
The Semantic Differential, 


This technique was originally developed by 
Osgood and his associates (64) as a tool for research on the psychology of 
meaning. It was only later that its possibilities for personality assessment 
were recognized. The Semantic Differential represents a standardized and 
quantified procedure for measuring the connotations of any given concept for 
the individual. Each concept is rated on a 7-point graphic scale as being more 
closely related to one or the other of a pair of opposites, such as good-bad 


OE Fast SION. Every concept to be investigated is paired in turn with each 
scale, as illustrated below: 
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worthless 


CHILD valuable — : jq E À 

HATRED tense o pomo Ao e relaxed 

SEX strong Jeu e ed weak 
small 


CHILD large E : E i : 


lly to a particular concept, as in the 


Some of the scales can be applied litera 
any of the ratings are obviously 


rating of Child on the large-small scale. M 
cal usage, as 
ally inapplicable to a concept, the subject 


influenced by common metaphori when a person Is described as 


"cold." When a scale appears tot 


would presumably check the middle position. A i 
Although of relatively recent origin, the Semantic Differential has already 
© - e 


been employed in considerable research, which has contributed to its con- 
struct validation. Tests carried out at four age levels from the first grade to 
college showed that, with increasing age. subjects tend to agree more closely 
With each other in the connotations of common objects (cf. 64, p. 289). In- 
tercorrelations and factorial analyses of different scales have revealed three 
major factors: Evaluative, With high loa u 
valuable-worthless, and clean-dirty; Potency, found in Such Beales as strong- 
weak, large-small, and heavy-light; and Activity, bici in such scales as 
active-passive, fast-slow, and sharp-dull. The evaluative factor is the most 
for the largest percentage of total variance. 

Responses on the Semantic Differential can be analyzed in several ways. 
Tre overall similarity of any two concepts for an individual or a group can 
be determined in terms of their positions on all scales. The connotations of 
by an individual can be investigated by computing the 
cept in the three principal factors described above. An 
approximation of these factor ice pes ka) by averaging the ratings 
of each concept on those scales ES e righest loadings on each factor. 
Thus on a scale extending from 4-3 to —3, a given individual's concept of "My 


Á .—2 in the evaluative factor, 0.1 in potency, an i 
Brother" might rate P! y, and 2.7 in 


dings in such scales as good-bad. 


conspicuous, accounting 


all concepts rated 
"score" of each con 


activity. dings of the three factors in each concept can be more easily vis- 

The loa SEM of three-dimensional models, as illustrated in Figure 123 
ualized by D. are taken from an intensive study of a well-known case of 
sonality, in which the patient shifted back and forth between 
asted “selves” (80, 81). The one self, designated as “Eve 
meek, self-critical, frustrated, and unhappy. The other, “Eve 


5 irresponsible, self-centered, fun-loving, and mischievous. “Jane,” 
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Father Love 


Eve Black 


" ! . i le 
Fig. 123. Semantic Differentia]: Semantic Patterns Obtained in a Case of Multip 
Personality. (From Osgood and Luria e) 


adjusted personality, emerged in 

ally seemed to be replacing the other pud 
A blind analysis of the Semantic Differential patterns of Eve White and Ev 
Black led to personalit 


ake ith the 
s y descriptions that. agreed remarkably well with 
Case reports of the thera 


Figure 123 shows the Semantic 
Black as they appeared early in th 
top, “active” at the lef; 
the zero point for all t 
“Doctor” 


Pists (64, 65), 


: ve 
Differential patterns of Eve White — 
crapy. In these diagrams, "good" is a 


it e ; is at 
t, and “weak toward the reader. The dark circle is @ 
hree dimensi 


erns. Eve White places “Me” much lower 
does Eve Black. The latter also shows less 
the evaluative dimension. Eve Black's dis- 
ed by her placing Hate and Fraud along 
herself toward the good end of the scale, 
nd Love are classified with Confusion and 
Eve White's pattern, the separation of Love 


differentiation of conce 
torted values are further indicat 
with Peace, Father, Doctor, and 
while Child, Spouse, Job, Sex, a 
Sickness toward the bad end. In 


pts along 
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ang Sex and the neutral position of Spouse reflect some of her adjustment 
difficulties in marriage. Changes occurring in the course of therapy were also 
accompanied by shifts in the position of concepts in subsequent Semantic 
Differential patterns. 

Role Construct Repertory Test. A technique devised specifically as an aid in 
clinical practice is the Role Construct Repertory Test (Rep Test) developed 
by Kelly (54). This test has common features with a number of other per- 
Sonality tests, notably the Semantic Differential and the various sorting tests 
used to study concept formation (Ch. 12). In the Rep Test, however, the ob- 
Jects to be sorted are persons who are important in the subject's life. And un- 
like the Semantic Differential, the Rep Test requires the subject himself to 
designate the scales, dimensions. or constructs in terms of which he charac- 


terizes these persons. 

The development of the Rep Test 
Sonality theory. A basic proposition in t 
COnstructs an individual uses to perceive objects or 
havior. In the course of psychotherapy. it is frequently necessary to build new 
Constructs and to discard some old constructs before progress can be made. 

The Rep Test is designed to help the clinician identify some of the client's 


important constructs about people. Although the test cn be administered in 
group and individual versions, one of its simpler 


ate its essential characteristics. In this variant 
Title List and asked to name a person in his 
A few examples are: 


is intimately related to Kelly’s per- 
his theory is that the concepts or 
events influence his be- 


many ways, including both 
Variants will serve to illustr 
the subject is first given a Role 
experience who fits each role title. 


A teacher you liked 


Your father 


Your wife or present girl friend 


— with whom you have been closely associated recently who appears to 
dislike you 
— next selects three of the persons named and asks, "In what 
important way are two of them alike but different from the third?” This 
procedure is repeated with many other sets of three names, in which some of 
the names recur In different combinations. The Rep Test yields a wealth of 
e data. A simplified factor-analytic proced 

" quantitative identificati d bad mu e M MEN 
for entification of constructs that < i 

d nat are important for 


qualitati 


veloped 5 
each indiv 
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FURTHER PROCEDURES FOR THE APPRAISAL OF 
PERSONALITY 
The tests considered in this and the preceding three chapters give ampe 

evidence of the variety of approaches that have been followed in the jn 

ment of personality variables. Earlier chapters provided a picture m Men 
diversity of procedures commonly erouped under projective techniques a " 
under self-report inventories. In this Chapter, we have examined several re 

atively new types of tests, including situational tests; duni 

tests utilizing Perceptual, cognitive, ang evaluative functions: procedu i 

based on Sociometry; and a number of attempts to systematize the explora 

tion of self concepts and personal constructs. 3 
Nor does this survey by any means exhaust the methods that have sie 

utilized for the appraisal of Personality. Research on possible physical gue 

physiological indicators of Personality characteristics has been going on for 
many decades, although so far the results have been | 

Ch. 5; 21, Chs: 7 and 15; 22; 40, Ch. 


task-oriented structured 


; of. 4, 
argely negative (cf. : 
i n asure 

14). Among the physiological measu | 

à $ ; j , cps 
investigated in this connection are muscle tension, basal metabolic rate, 


blood Pressure, pulse rate, and galvanic skin reaction. Considerable aes 
has been directed toward investigating Possible relationships between d 
lectual or personality variables and critical flicker frequency (CFF). or the 
rate at which rst seen to fuse. It is believed that CFF may 
ney of functioning of the nervous DOM 
measurement of the electrica] activity of E 

sed in the study of personality deviations. ied 
ation of body build to personality factors has been periodi- 


heories of constitutional types, of which the most recent is 
f. 4, Ch. 6; 74: TSY; 
y of literature q 
wise accumulated (2; 5, Chs. 
92). Expressive 


ealing with expressive movements has likes 
> 11, Chs. 13-15 and 21; 40, Ch. 11; 77; 
ve been defined as “those aspects of movemen: 
to differentiate one individual from another 
de not only gross bodily movements like walk- 
various motor tasks, but also h 


movements ha 
are distinctive enough 
(2, p. vii). Such reactions inclu 
ing, gesturing, 


and performing 
speech. Hand 


andwriting and 
writing has recei 


ved special attention, because it represents a 


(5, Ch. 15; 11, Ch. 14; 77). Available data 
haracteristics of handwriting or other expres- 
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Sive movements and personality variables, however, are negative or incon- 
clusive. I» 
Many types of personal documents, such as letters, diaries, PM ae 
phies, and art products, have likewise been subjected to intensive investiga 
tion for possible clues to attitudes, emotional traits, o eden buie 
Personality (cf. 1; 73, pp. 323-330). One attempt to T he cast 
of verbal self-reports is to be found in the fcm Nghe ( " 
Proposed by Dollard and Mowrer (30); This index is » a B sich e 
Of words indicating discomfort (unhappiness, dide SUI ering) n is mar 
ber of words indicating relief (satisfaction, comfort, enjoyment). It may als 
Es ught units in place of words. 
- represent another source of information 
made detailed records of the be- 


be found by using sentences or tho 
Direct observations of behavior have 
about personality. Eh sup in school, boys in camp. em- 
NU Sa c an "i A number of other natural situations of 
daily life pacer for their lack of control over stimulus conditions, SUE 
dx i ‘fer much from some of the previously discussed situa- 
oo ire lion observations may be Tenaered more kie Pe 
sts. use of sound recorders, motion pictures, and 
and accurate through bae: vices. To obtain a representative picture of the 
other automatic recording trim situation, time sampling may be employed. 
individual's behavior in 2 distribution of observation periods. Depending 
a rondom aa of the observations, such periods may vary in 
upon the nature and p pine to several hours; and they may be con- 
3 grupo over several months; i e 
centrated in one day ver all behavior occurring during the specified period, 
Observations may a to a certain type of behavior, such as crying or ag- 
or they may be anne school children. The critical incident technique, 
gressive behavior ue behavior considered to be especially favorable or un- 
in which enn purpose are recorded, is a special example of such se- 
favorable for 2 uM Thus during a two-month period the supervisor of a re- 
lective opera asked to keep a record of all instances of specific actions 
search unit ee productive and of unproductive research workers on his 
characteris” nique. which has been widely applied by Flanagan, forms the 
staff. Ths we randardized Performance Record (33) designed for use by 
basis of M charting the personal and social development of elementary 
"ipildren- 
ation should also be made of the time-honored source of information 


oe by interviewing techniques (40, Ch. 7). 
piov 


This involves 


duration from less th 


Interviews may vary 
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from the highly structured (re 


presenting little more than an orally adminis- 
tered questionnaire), 


through patterned or guided int 


erviews covering cer- 
tain predetermined areas, to non-directive ; 


age and encourages the subject to talk as freely 
as possible. Interviews provide chiefly two kinds of information. First, they 
afford an Opportunity for direct observation of a rather limited sample of 
behavior manifested during the interview situation itself, For example, the 
individual’s speech, and manner in meeting a stran- 
ant function of interviewing, how- 
individual has done in the past is a 
future, especially when interpreted 
5 and of the subjects comments re- 


ation procedure. Some of the ad- 
peer nominations have already been mentioned 


he name of “buddy rat- 


used by Hartshorne and May 


Guess Who” technique, first 
technique, the childre 


in the Character 
n are given 


der each the 


a jolly good fellov—friends with every one, no matter who they are. 
This one is always Picking 9n others and annoying them. 


Some of these techniques, includin 


g direct Observations, interviews, and 
ratings, are š be evaluated as such. Others represent 
areas of inquiry out of which or other indicators of personality char- 
8e. It should also be noted that many of the 
à E 9n are concerned not only with personality 
traits in the restricted Sense, but with all behavior characteristics, 


From the bewildering diversity of techni 


rmative Stage. Few if any available instruments have 
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a ; rA 
s yet proved their value empirically to the same extent as have aptitude or 
achievement tests. Consequently, the tester in this field must proceed warily 


—at his own risk. Personality testing today offers a real challenge, both to 


the creative ingenuity of the test constructor and to the scientific vigilance of 


the test user. Even more than in other branches of psychological testing, the 
tests requires the ability to recognize prom- 


fullest utilization of personality 
aims—to be receptive toward what is 


Ise, without accepting unsupported cl 


New, without being credulous toward what is unverified. 
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Scientific aptitude tests, 416-420, 480 
corer bias, 71 
ccce reliability, 110-111, 118-119, 449, 
593 
Scoring procedures, 66-71 
Screening, 179 
Seashore Measures of Musical Talents, 
374-375, 409-410 
Seashore-Bennett Stenographic Proficiency 
Test, 470 
Seashore-Hevner Tests for Attitude toward 
Music, 412 
Second-order factor, 343 
Seguin Form Board, 6, 241-242 
Selection, 179 
Selective Service College Qu 
226 
Self-concept tests, 623.626 
Self-report inventories, 16-17, 493-523 
Self-scoring answer sheets, 68-70 
Semantic Test of Intelligence, 263-264 
Sensory tests, examples, 368-378 
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nique, 580-581 
Sentence Completion Test (Rohde), 581 
Sequential Tests of Educational Progress 
(STEP), 131-132, 447-453, 487 
Semantic Differential, 626-629 
Short Employment Tests, 396 
Sight-Screener, 369 
Significance levels, 1 16-117 
Situational stress tests, 608-610 
Situational tests, 17, 605-612 
Situational variables, 65-66 
Sixteen Personality Factor Questionnaire, 
510 
Skewed distribution curves, 26-27 
Snellen Chart, 368-371 
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Sociometry, 620-623 
Sources of information 
Spearman-Brown formula, 121-122 
Special aptitude tests, 12-14, 32, 365-420 
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Spiral omnibus tests, 224 
Split-half reliability, 120-122 
SRA Achievement Series, 454 
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SRA Clerical Aptitudes, 396 
SRA Junior Inventory, 495 
SRA Mechanical Aptitudes, 393 
SRA Non-Verbal Form, 253 350 
SRA Primary Mental Abilities, a 
SRA Typing Adaptability Test, 47 
SRA Youth Inventory, 495 
Stability coefficient, 118 
Standard deviation, 81 ) 
Standard error, of estimate, 159-166 132- 
of intra-individual score differences. 14 
133 
of measurement, 105, 129-131 
Standard scores, 90-98 
and achievement test norms, 440 23 
Standardization of testing procedure. 2 
in projective techniques, 591-592 
Stanford Achievement Test, 454 
Stanford-Binet, 11, 189-210 
Stanine, 93, 96-97 
Stealing, tests for detecting, 606 
Stenography tests, 396, 470 
Stenquist Assembly Test, 390 
Story completion tests, 581-582 
Strategy, in decision theory, 162 
Street Gestalt Completion, 614 1 
Stress tests, see Situational stress tests 
Stromber. Dexterity Test, 382 É 
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530-536 
Stroop Color Word Test, 615 
Student Opinion, surveys of, 543 — zey): 
Study of Values (Allport-Vernon-Lindze) 
552-555 
Superior adults, tests for, 226-233 
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Syracuse Scales of Social Relations, ®+- 
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Tapping Test, 396 
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613 " 
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Taylor Manifest Anxiety Scale, 505-506 
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quu 221223 
P eese esee test, see Stanford-Binet 
an-Miles M-F test, see Attitude-In- 
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T nr Insight into Human Nature. 
Test publishers, 638 
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ts of General Educational Development. 
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hematic Apperception T 
578 
adaptation for Negroes. 577 
p in personality research, 575-576 
ne Interest Schedule, 540 
ee Temperament Schedule. 
Th rstone Test of Mental Alertness, 23 
urstone-type attitude scales, 547-551 
effect of judges’ attitudes. 550-551 
ge use in industry, 550 
Time sampling, 631 
Time-limit tests, see Speed tests 
Tomkins-Horn Picture Arrange 
: (PAT), 585-586 
Toy tests, 589 
T rade tests, 468-472 
Trait acquaintance, in ratings, 145 
True score, estimate of, 130-131 
Turse Clerical Aptitudes Test, 396 
Turse Shorthand Aptitude Test. 396 
Two-Factor theory, 343-344 
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United States Employment Service. 
176. 358-360, 395, 471 
Uses of psychological tests, 3-5 


Validity, 29-31, 13 
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and decision theory, 160-168 


dependence 
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of achievement tests. 460-461 
of battery, 174 
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of projective tests 
of situational tests, 
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Vision, tests of. 368-374 
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Watch tick test. 375 
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deterioration index. 322 
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